[dpdk-dev] [PATCH] mem: balanced allocation of hugepages

Ilya Maximets i.maximets at samsung.com
Thu Mar 9 13:57:24 CET 2017


On 08.03.2017 16:46, Sergio Gonzalez Monroy wrote:
> Hi Ilya,
> 
> I have done similar tests and as you already pointed out, 'numactl --interleave' does not seem to work as expected.
> I have also checked that the issue can be reproduced with quota limit on hugetlbfs mount point.
> 
> I would be inclined towards *adding libnuma as dependency* to DPDK to make memory allocation a bit more reliable.
> 
> Currently at a high level regarding hugepages per numa node:
> 1) Try to map all free hugepages. The total number of mapped hugepages depends if there were any limits, such as cgroups or quota in mount point.
> 2) Find out numa node of each hugepage.
> 3) Check if we have enough hugepages for requested memory in each numa socket/node.
> 
> Using libnuma we could try to allocate hugepages per numa:
> 1) Try to map as many hugepages from numa 0.
> 2) Check if we have enough hugepages for requested memory in numa 0.
> 3) Try to map as many hugepages from numa 1.
> 4) Check if we have enough hugepages for requested memory in numa 1.
> 
> This approach would improve failing scenarios caused by limits but It would still not fix issues regarding non-contiguous hugepages (worst case each hugepage is a memseg).
> The non-contiguous hugepages issues are not as critical now that mempools can span over multiple memsegs/hugepages, but it is still a problem for any other library requiring big chunks of memory.
> 
> Potentially if we were to add an option such as 'iommu-only' when all devices are bound to vfio-pci, we could have a reliable way to allocate hugepages by just requesting the number of pages from each numa.
> 
> Thoughts?

Hi Sergio,

Thanks for your attention to this.

For now, as we have some issues with non-contiguous
hugepages, I'm thinking about following hybrid schema:
1) Allocate essential hugepages:
	1.1) Allocate as many hugepages from numa N to
	     only fit requested memory for this numa.
	1.2) repeat 1.1 for all numa nodes.
2) Try to map all remaining free hugepages in a round-robin
   fashion like in this patch.
3) Sort pages and choose the most suitable.

This solution should decrease number of issues connected with
non-contiguous memory.

Best regards, Ilya Maximets.

> 
> On 06/03/2017 09:34, Ilya Maximets wrote:
>> Hi all.
>>
>> So, what about this change?
>>
>> Best regards, Ilya Maximets.
>>
>> On 16.02.2017 16:01, Ilya Maximets wrote:
>>> Currently EAL allocates hugepages one by one not paying
>>> attention from which NUMA node allocation was done.
>>>
>>> Such behaviour leads to allocation failure if number of
>>> available hugepages for application limited by cgroups
>>> or hugetlbfs and memory requested not only from the first
>>> socket.
>>>
>>> Example:
>>>     # 90 x 1GB hugepages availavle in a system
>>>
>>>     cgcreate -g hugetlb:/test
>>>     # Limit to 32GB of hugepages
>>>     cgset -r hugetlb.1GB.limit_in_bytes=34359738368 test
>>>     # Request 4GB from each of 2 sockets
>>>     cgexec -g hugetlb:test testpmd --socket-mem=4096,4096 ...
>>>
>>>     EAL: SIGBUS: Cannot mmap more hugepages of size 1024 MB
>>>     EAL: 32 not 90 hugepages of size 1024 MB allocated
>>>     EAL: Not enough memory available on socket 1!
>>>          Requested: 4096MB, available: 0MB
>>>     PANIC in rte_eal_init():
>>>     Cannot init memory
>>>
>>>     This happens beacause all allocated pages are
>>>     on socket 0.
>>>
>>> Fix this issue by setting mempolicy MPOL_PREFERRED for each
>>> hugepage to one of requested nodes in a round-robin fashion.
>>> In this case all allocated pages will be fairly distributed
>>> between all requested nodes.
>>>
>>> New config option RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES
>>> introduced and disabled by default because of external
>>> dependency from libnuma.
>>>
>>> Cc: <stable at dpdk.org>
>>> Fixes: 77988fc08dc5 ("mem: fix allocating all free hugepages")
>>>
>>> Signed-off-by: Ilya Maximets <i.maximets at samsung.com>
>>> ---
>>>   config/common_base                       |  1 +
>>>   lib/librte_eal/Makefile                  |  4 ++
>>>   lib/librte_eal/linuxapp/eal/eal_memory.c | 66 ++++++++++++++++++++++++++++++++
>>>   mk/rte.app.mk                            |  3 ++
>>>   4 files changed, 74 insertions(+)
>>>
> 
> 
> 
> 


More information about the dev mailing list