[dpdk-dev] [RFC 00/35] mempool: rework memory allocation

Olivier Matz olivier.matz at 6wind.com
Wed Mar 9 17:19:06 CET 2016


This series is a rework of mempool. For those who don't want to read
all the cover letter, here is a sumary:

- it is not possible to allocate large mempools if there is not enough
  contiguous memory, this series solves this issue
- introduce new APIs with less arguments: "create, populate, obj_init"
- allow to free a mempool
- split code in smaller functions, will ease the introduction of ext_handler
- remove test-pmd anonymous mempool creation
- remove most of dom0-specific mempool code
- opens the door for a eal_memory rework: we probably don't need large
  contiguous memory area anymore, working with pages would work.

This will clearly break the ABI, but as there are already 2 other changes that
will break it for 16.07, the target for this series is 16.07. I plan to send a
deprecation notice for 16.04 soon.

The API stays almost the same, no modification is needed in examples app
or in test-pmd. Only kni and mellanox drivers are slightly modified.

Description of the initial issue
--------------------------------

The allocation of mbuf pool can fail even if there is enough memory.
The problem is related to the way the memory is allocated and used in
dpdk. It is particularly annoying with mbuf pools, but it can also fail
in other use cases allocating a large amount of memory.

- rte_malloc() allocates physically contiguous memory, which is needed
  for mempools, but useless most of the time.

  Allocating a large physically contiguous zone is often impossible
  because the system provide hugepages which may not be contiguous.

- rte_mempool_create() (and therefore rte_pktmbuf_pool_create())
  requires a physically contiguous zone.

- rte_mempool_xmem_create() does not solve the issue as it still
  needs the memory to be virtually contiguous, and there is no
  way in dpdk to allocate a virtually contiguous memory that is
  not also physically contiguous.

How to reproduce the issue
--------------------------

- start the dpdk with some 2MB hugepages (it can also occur with 1GB)
- allocate a large mempool
- even if there is enough memory, the allocation can fail

Example:

  git clone http://dpdk.org/git/dpdk
  cd dpdk
  make config T=x86_64-native-linuxapp-gcc
  make -j32
  mkdir -p /mnt/huge
  mount -t hugetlbfs nodev /mnt/huge
  echo 256 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

  # we try to allocate a mempool whose size is ~450MB, it fails
  ./build/app/testpmd -l 2,4 -- --total-num-mbufs=200000 -i

The EAL logs "EAL: Virtual area found at..." shows that there are
several zones, but all smaller than 450MB.

Workarounds:

- Use 1GB hugepages: it sometimes work, but for very large
  pools (millions of mbufs) there is the same issue. Moreover,
  it would consume 1GB memory at least which can be a lot
  in some cases.

- Reboot the machine or allocate hugepages at boot time: this increases
  the chances to have more contiguous memory, but does not completely
  solve the issue

Solutions
---------

Below is a list of proposed solutions. I implemented a quick and dirty
PoC of solution 1, but it's not working in all conditions and it's
really an ugly hack.  This series implement the solution 4 which looks
the best to me, knowing it does not prevent to do more enhancements
in dpdk memory in the future (solution 3 for instance).

Solution 1: in application
--------------------------

- allocate several hugepages using rte_malloc() or rte_memzone_reserve()
  (only keeping complete hugepages)
- parse memsegs and /proc/maps to check which files mmaps these pages
- mmap the files in a contiguous virtual area
- use rte_mempool_xmem_create()

Cons:

- 1a. parsing the memsegs of rte config in the application does not
  use a public API, and can be broken if internal dpdk code changes
- 1b. some memory is lost due to malloc headers. Also, if the memory is
  very fragmented (ex: all 2MB pages are physically separated), it does
  not work at all because we cannot get any complete page. It is not
  possible to use a lower level allocator since commit fafcc11985a.
- 1c. we cannot use rte_pktmbuf_pool_create(), so we need to use mempool
  api and do a part of the job manually
- 1d. it breaks secondary processes as the virtual addresses won't be
  mmap'd at the same place in secondary process
- 1e. it only fixes the issue for the mbuf pool of the application,
  internal pools in dpdk libraries are not modified
- 1f. this is a pure linux solution (rte_map files)
- 1g. The application has to be aware of RTE_EAL_SINGLE_SEGMENTS option
  that changes the way hugepages are mapped. By the way, it's strange
  to have such a compile-time option, we should probably have only
  one behavior that works all the time.

Solution 2: in dpdk memory allocator
------------------------------------

- do the same than solution 1 in a new function rte_malloc_non_contig():
  allocate several chunks and mmap them in a contiguous virtual memory
- a flag has to be added in malloc header to do the proper cleanup in
  rte_free() (free all the chunks, munmap the memory)
- introduce a new rte_mem_get_physmap(*physmap,addr, len) that returns
  the virt2phys mapping of a virtual area in dpdk
- add a mempool flag MEMPOOL_F_NON_PHYS_CONTIG to use
  rte_malloc_non_contig() to allocate the area storing the objects

Cons:

- 2a. same than 1b: it breaks secondary processes if the mempool flag is
  used.
- 2b. same as 1d: some memory is lost due to malloc headers, and it
  cannot work if memory is too fragmented.
- 2c. rte_malloc_virt2phy() cannot be used on these zones. It would
  return the physical address of the first page. It would be better to
  return an error in this case.
- 2d. need to check how to implement this on bsd (TBD)

Solution 3: in dpdk eal memory
------------------------------

- Rework the way hugepages are mmap'd in dpdk: instead of having several
  rte_map* files, just mmap one file per node. It may drastically
  simplify EAL memory management in dpdk.
- An API should be added to retrieve the physical mapping of a virtual
  area (ex: rte_mem_get_physmap(*physmap, addr, len))
- rte_malloc() and rte_memzone_reserve() won't allocate physically
  contiguous memory anymore (TBD)
- Update mempool to always use the rte_mempool_xmem_create() version

Cons:

- 3a. lot of rework in eal memory, it will induce some behavior changes
  and maybe api changes
- 3b. possible conflicts with xen_dom0 mempool

Solution 4: in mempool
----------------------

- Introduce a new API to fill a mempool with zones that are not
  virtually contiguous. It requires to add new functions to create and
  populate a mempool. Example (TBD):

  - rte_mempool_create_empty(name, n, elt_size, cache_size, priv_size)
  - rte_mempool_populate(mp, addr, len): add virtual memory for objects
  - rte_mempool_mempool_obj_iter(mp, obj_cb, arg): call a cb for each object

- update rte_mempool_create() to allocate objects in several memory
  chunks by default if there is no large enough physically contiguous
  memory.

Tests done
----------

Compilation
~~~~~~~~~~~

The following targets:

 x86_64-native-linuxapp-gcc
 i686-native-linuxapp-gcc
 x86_x32-native-linuxapp-gcc
 x86_64-native-linuxapp-clang

Libraries ith and without debug, in static and shared mode + examples.

autotests
~~~~~~~~~

cd /root/dpdk.org
make config T=x86_64-native-linuxapp-gcc O=x86_64-native-linuxapp-gcc
make -j4 O=x86_64-native-linuxapp-gcc EXTRA_CFLAGS="-g -O0"
modprobe uio_pci_generic
python tools/dpdk_nic_bind.py -b uio_pci_generic 0000:03:00.0
python tools/dpdk_nic_bind.py -b uio_pci_generic 0000:08:00.0
mkdir -p /mnt/huge
mount -t hugetlbfs nodev /mnt/huge
echo 256 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

./x86_64-native-linuxapp-gcc/app/test -l 0,2,4 -n 4
  memory_autotest       OK
  memzone_autotest      OK
  ring_autotest         OK
  ring_perf_autotest    OK
  mempool_autotest      OK
  mempool_perf_autotest OK

same with --no-huge
  memory_autotest       OK
  memzone_autotest      KO  (was already KO)
  mempool_autotest      OK
  mempool_perf_autotest OK
  ring_autotest         OK
  ring_perf_autotest    OK

test-pmd
~~~~~~~~

# now starts fine, was failing before if mempool was too fragmented
./x86_64-native-linuxapp-gcc/app/testpmd -l 0,2,4 -n 4 -- -i --port-topology=chained

# still ok
./x86_64-native-linuxapp-gcc/app/testpmd -l 0,2,4 -n 4 -m 256 -- -i --port-topology=chained --mp-anon
set fwd txonly
start
stop

# fail, but was failing before too. The problem is because the physical
# addresses are not properly set when using --no-huge. The mempool phys addr
# are now correct, but the zones allocated through memzone_reserve() are
# still wrong. This could be fixed in a future series.
./x86_64-native-linuxapp-gcc/app/testpmd -l 0,2,4 -n 4 -m 256 --no-huge -- -i ---port-topology=chained
set fwd txonly
start
stop

Tests not done (for now)
------------------------

- mellanox driver (it is slightly modified)
- kni (it is slightly modified)
- compilation and test with freebsd
- compilation and test with xen
- compilation and test on other archs


Olivier Matz (35):
  mempool: fix comments and style
  mempool: replace elt_size by total_elt_size
  mempool: uninline function to check cookies
  mempool: use sizeof to get the size of header and trailer
  mempool: rename mempool_obj_ctor_t as mempool_obj_cb_t
  mempool: update library version
  mempool: list objects when added in the mempool
  mempool: remove const attribute in mempool_walk
  mempool: use the list to iterate the mempool elements
  eal: introduce RTE_DECONST macro
  mempool: use the list to audit all elements
  mempool: use the list to initialize mempool objects
  mempool: create the internal ring in a specific function
  mempool: store physaddr in mempool objects
  mempool: remove MEMPOOL_IS_CONTIG()
  mempool: store memory chunks in a list
  mempool: new function to iterate the memory chunks
  mempool: simplify xmem_usage
  mempool: introduce a free callback for memory chunks
  mempool: make page size optional when getting xmem size
  mempool: default allocation in several memory chunks
  eal: lock memory when using no-huge
  mempool: support no-hugepage mode
  mempool: replace mempool physaddr by a memzone pointer
  mempool: introduce a function to free a mempool
  mempool: introduce a function to create an empty mempool
  eal/xen: return machine address without knowing memseg id
  mempool: rework support of xen dom0
  mempool: create the internal ring when populating
  mempool: populate a mempool with anonymous memory
  test-pmd: remove specific anon mempool code
  mempool: make mempool populate and free api public
  mem: avoid memzone/mempool/ring name truncation
  mempool: new flag when phys contig mem is not needed
  mempool: update copyright

 app/test-pmd/Makefile                        |    4 -
 app/test-pmd/mempool_anon.c                  |  201 -----
 app/test-pmd/mempool_osdep.h                 |   54 --
 app/test-pmd/testpmd.c                       |   17 +-
 app/test/test_mempool.c                      |   21 +-
 doc/guides/rel_notes/release_16_04.rst       |    2 +-
 drivers/net/mlx4/mlx4.c                      |   71 +-
 drivers/net/mlx5/mlx5_rxq.c                  |    9 +-
 drivers/net/mlx5/mlx5_rxtx.c                 |   62 +-
 drivers/net/mlx5/mlx5_rxtx.h                 |    2 +-
 drivers/net/xenvirt/rte_eth_xenvirt.h        |    2 +-
 drivers/net/xenvirt/rte_mempool_gntalloc.c   |    4 +-
 lib/librte_eal/common/eal_common_log.c       |    2 +-
 lib/librte_eal/common/eal_common_memzone.c   |   10 +-
 lib/librte_eal/common/include/rte_common.h   |    9 +
 lib/librte_eal/common/include/rte_memory.h   |   11 +-
 lib/librte_eal/linuxapp/eal/eal_memory.c     |    2 +-
 lib/librte_eal/linuxapp/eal/eal_xen_memory.c |   17 +-
 lib/librte_kni/rte_kni.c                     |   12 +-
 lib/librte_mempool/Makefile                  |    5 +-
 lib/librte_mempool/rte_dom0_mempool.c        |  133 ----
 lib/librte_mempool/rte_mempool.c             | 1031 +++++++++++++++++---------
 lib/librte_mempool/rte_mempool.h             |  590 +++++++--------
 lib/librte_mempool/rte_mempool_version.map   |   18 +-
 lib/librte_ring/rte_ring.c                   |   16 +-
 25 files changed, 1121 insertions(+), 1184 deletions(-)
 delete mode 100644 app/test-pmd/mempool_anon.c
 delete mode 100644 app/test-pmd/mempool_osdep.h
 delete mode 100644 lib/librte_mempool/rte_dom0_mempool.c

-- 
2.1.4



More information about the dev mailing list