[dpdk-dev] [PATCH] mem: balanced allocation of hugepages

Ilya Maximets i.maximets at samsung.com
Mon Apr 10 10:05:56 CEST 2017



On 10.04.2017 10:51, Sergio Gonzalez Monroy wrote:
> On 10/04/2017 08:11, Ilya Maximets wrote:
>> On 07.04.2017 18:44, Thomas Monjalon wrote:
>>> 2017-04-07 18:14, Ilya Maximets:
>>>> Hi All.
>>>>
>>>> I wanted to ask (just to clarify current status):
>>>> Will this patch be included in current release (acked by maintainer)
>>>> and then I will upgrade it to hybrid logic or I will just prepare v3
>>>> with hybrid logic for 17.08 ?
>>> What is your preferred option Ilya?
>> I have no strong opinion on this. One thought is that it could be
>> nice if someone else could test this functionality with current
>> release before enabling it by default in 17.08.
>>
>> Tomorrow I'm going on vacation. So I'll post rebased version today
>> (there are few fuzzes with current master) and you with Sergio may
>> decide what to do.
>>
>> Best regards, Ilya Maximets.
>>
>>> Sergio?
> 
> I would be inclined towards v3 targeting v17.08. IMHO it would be more clean this way.

OK.
I've sent rebased version just in case.

> 
> Sergio
> 
>>>
>>>> On 27.03.2017 17:43, Ilya Maximets wrote:
>>>>> On 27.03.2017 16:01, Sergio Gonzalez Monroy wrote:
>>>>>> On 09/03/2017 12:57, Ilya Maximets wrote:
>>>>>>> On 08.03.2017 16:46, Sergio Gonzalez Monroy wrote:
>>>>>>>> Hi Ilya,
>>>>>>>>
>>>>>>>> I have done similar tests and as you already pointed out, 'numactl --interleave' does not seem to work as expected.
>>>>>>>> I have also checked that the issue can be reproduced with quota limit on hugetlbfs mount point.
>>>>>>>>
>>>>>>>> I would be inclined towards *adding libnuma as dependency* to DPDK to make memory allocation a bit more reliable.
>>>>>>>>
>>>>>>>> Currently at a high level regarding hugepages per numa node:
>>>>>>>> 1) Try to map all free hugepages. The total number of mapped hugepages depends if there were any limits, such as cgroups or quota in mount point.
>>>>>>>> 2) Find out numa node of each hugepage.
>>>>>>>> 3) Check if we have enough hugepages for requested memory in each numa socket/node.
>>>>>>>>
>>>>>>>> Using libnuma we could try to allocate hugepages per numa:
>>>>>>>> 1) Try to map as many hugepages from numa 0.
>>>>>>>> 2) Check if we have enough hugepages for requested memory in numa 0.
>>>>>>>> 3) Try to map as many hugepages from numa 1.
>>>>>>>> 4) Check if we have enough hugepages for requested memory in numa 1.
>>>>>>>>
>>>>>>>> This approach would improve failing scenarios caused by limits but It would still not fix issues regarding non-contiguous hugepages (worst case each hugepage is a memseg).
>>>>>>>> The non-contiguous hugepages issues are not as critical now that mempools can span over multiple memsegs/hugepages, but it is still a problem for any other library requiring big chunks of memory.
>>>>>>>>
>>>>>>>> Potentially if we were to add an option such as 'iommu-only' when all devices are bound to vfio-pci, we could have a reliable way to allocate hugepages by just requesting the number of pages from each numa.
>>>>>>>>
>>>>>>>> Thoughts?
>>>>>>> Hi Sergio,
>>>>>>>
>>>>>>> Thanks for your attention to this.
>>>>>>>
>>>>>>> For now, as we have some issues with non-contiguous
>>>>>>> hugepages, I'm thinking about following hybrid schema:
>>>>>>> 1) Allocate essential hugepages:
>>>>>>>      1.1) Allocate as many hugepages from numa N to
>>>>>>>           only fit requested memory for this numa.
>>>>>>>      1.2) repeat 1.1 for all numa nodes.
>>>>>>> 2) Try to map all remaining free hugepages in a round-robin
>>>>>>>      fashion like in this patch.
>>>>>>> 3) Sort pages and choose the most suitable.
>>>>>>>
>>>>>>> This solution should decrease number of issues connected with
>>>>>>> non-contiguous memory.
>>>>>> Sorry for late reply, I was hoping for more comments from the community.
>>>>>>
>>>>>> IMHO this should be default behavior, which means no config option and libnuma as EAL dependency.
>>>>>> I think your proposal is good, could you consider implementing such approach on next release?
>>>>> Sure, I can implement this for 17.08 release.
>>>>>
>>>>>>>> On 06/03/2017 09:34, Ilya Maximets wrote:
>>>>>>>>> Hi all.
>>>>>>>>>
>>>>>>>>> So, what about this change?
>>>>>>>>>
>>>>>>>>> Best regards, Ilya Maximets.
>>>>>>>>>
>>>>>>>>> On 16.02.2017 16:01, Ilya Maximets wrote:
>>>>>>>>>> Currently EAL allocates hugepages one by one not paying
>>>>>>>>>> attention from which NUMA node allocation was done.
>>>>>>>>>>
>>>>>>>>>> Such behaviour leads to allocation failure if number of
>>>>>>>>>> available hugepages for application limited by cgroups
>>>>>>>>>> or hugetlbfs and memory requested not only from the first
>>>>>>>>>> socket.
>>>>>>>>>>
>>>>>>>>>> Example:
>>>>>>>>>>       # 90 x 1GB hugepages availavle in a system
>>>>>>>>>>
>>>>>>>>>>       cgcreate -g hugetlb:/test
>>>>>>>>>>       # Limit to 32GB of hugepages
>>>>>>>>>>       cgset -r hugetlb.1GB.limit_in_bytes=34359738368 test
>>>>>>>>>>       # Request 4GB from each of 2 sockets
>>>>>>>>>>       cgexec -g hugetlb:test testpmd --socket-mem=4096,4096 ...
>>>>>>>>>>
>>>>>>>>>>       EAL: SIGBUS: Cannot mmap more hugepages of size 1024 MB
>>>>>>>>>>       EAL: 32 not 90 hugepages of size 1024 MB allocated
>>>>>>>>>>       EAL: Not enough memory available on socket 1!
>>>>>>>>>>            Requested: 4096MB, available: 0MB
>>>>>>>>>>       PANIC in rte_eal_init():
>>>>>>>>>>       Cannot init memory
>>>>>>>>>>
>>>>>>>>>>       This happens beacause all allocated pages are
>>>>>>>>>>       on socket 0.
>>>>>>>>>>
>>>>>>>>>> Fix this issue by setting mempolicy MPOL_PREFERRED for each
>>>>>>>>>> hugepage to one of requested nodes in a round-robin fashion.
>>>>>>>>>> In this case all allocated pages will be fairly distributed
>>>>>>>>>> between all requested nodes.
>>>>>>>>>>
>>>>>>>>>> New config option RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES
>>>>>>>>>> introduced and disabled by default because of external
>>>>>>>>>> dependency from libnuma.
>>>>>>>>>>
>>>>>>>>>> Cc:<stable at dpdk.org>
>>>>>>>>>> Fixes: 77988fc08dc5 ("mem: fix allocating all free hugepages")
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Ilya Maximets<i.maximets at samsung.com>
>>>>>>>>>> ---
>>>>>>>>>>     config/common_base                       |  1 +
>>>>>>>>>>     lib/librte_eal/Makefile                  |  4 ++
>>>>>>>>>>     lib/librte_eal/linuxapp/eal/eal_memory.c | 66 ++++++++++++++++++++++++++++++++
>>>>>>>>>>     mk/rte.app.mk                            |  3 ++
>>>>>>>>>>     4 files changed, 74 insertions(+)
>>>>>> Acked-by: Sergio Gonzalez Monroy <sergio.gonzalez.monroy at intel.com>
>>>>> Thanks.
>>>>>
>>>>> Best regards, Ilya Maximets.
>>>>>
>>>
>>>
>>>
>>>
> 
> 
> 
> 


More information about the dev mailing list