Bug 1256 - drivers/common/mlx5: mlx5_malloc() called on invalid socket ID when global MR cache is full and rte_extmem_* API is used
Summary: drivers/common/mlx5: mlx5_malloc() called on invalid socket ID when global MR...
Status: UNCONFIRMED
Alias: None
Product: DPDK
Classification: Unclassified
Component: other (show other bugs)
Version: 21.11
Hardware: x86 Linux
: Normal normal
Target Milestone: ---
Assignee: dev
URL:
Depends on:
Blocks:
 
Reported: 2023-06-20 13:51 CEST by Marius-Cristian Baciu
Modified: 2023-07-06 11:06 CEST (History)
1 user (show)



Attachments

Description Marius-Cristian Baciu 2023-06-20 13:51:13 CEST
Overview:
Attempt to allocate a new mlx5 MR entry when global Btree cache is full ends up calling mlx5_malloc with the EXTERNAL_HEAP_MIN_SOCKET_ID socket ID, given that no external-memory heap has been created. Instead the rte_extmem_* API is used.

Steps to reproduce:
- start a primary DPDK process, on a NIC compatible with mlx5_core driver;
- use rte_extmem_register() to register >512 pages of 4KB;
- use rte_dev_dma_map() to dma-map each page;
- rte_eth_tx_burst() an mbuf with an external buffer from the last page of the registered memory (or a page above index 512). (a virtual address that will not be found in the global Btree cache);

Actual results:
mlx5_malloc in mlx5_mr_create_primary() fails with "Unable to allocate memory for a new MR". From this point forward, packets never reach the other end.

Expected results:
MR entries should be successfully retrieved from backup or created when cache becomes full; calling mlx5_malloc() on external heap socket should not be possible when rte_extmem_* API is used. As it is stated in the DPDK documentation[1], "Memory added this way will not be available for any regular DPDK allocators".

Build Date & Hardware:
20 Jun 2023 on Debian GNU/Linux 4.18.0

[1]: https://doc.dpdk.org/guides-21.11/prog_guide/env_abstraction_layer.html
Comment 1 Raslan Darawsheh 2023-07-06 10:28:30 CEST
This was fixed b this patch:

https://git.dpdk.org/dpdk/commit/?h=releases&id=147f6fb42bd7637b37a9180b0774275531c05f9b

could you kindly confirm?
Comment 3 Marius-Cristian Baciu 2023-07-06 11:06:49 CEST
Hi,

Unfortunately that patch only targets a memory socket issue with the ASO mechanism. However, in my setup ASO is never an issue - I actually do not believe it is enabled.

To give a little more insight, the problem I am describing manifests on the data path:
- rte_eth_tx_burst();
- mlx5_tx_burst_*() is called;
- at some later point, in mr_lookup_caches(), mr_btree_lookup() returns UINT32_MAX because all 256 entries in the cache have been occupied and last memory registration did not catch an empty slot;
- when mr_lookup_caches() fails, mlx5_mr_create() -> mlx5_mr_create_primary() is called;
- mlx5_malloc() at line 723 fails because it is called with an inappropriate socket ID (the socket ID of the memseg list associated with an external buffer (prior with rte_extmem_register()), EXTERNAL_HEAP_MIN_SOCKET_ID, which does not actually have a valid heap associated, from which memory could be allocated.

Note You need to log in before you can comment on or make changes to this bug.