[dpdk-dev] [PATCH 3/5] net/mlx5: add new Memory Region support

Yongseok Koh yskoh at mellanox.com
Tue May 8 03:52:46 CEST 2018


On Sun, May 06, 2018 at 05:53:18AM -0700, Shahaf Shuler wrote:
> Hi Koh,
> 
> Huge work. It takes (and will take) me some time to process. 
> In the meanwhile find some small comments. 
> 
> As this design heavily relies on synchronization (cache flush) between the
> control thread and the data path thread  along with possible deadlocks from
> the memory hotplug events the documentation is critical.  Otherwise future
> work will introduce heavy bugs. 

Right. Even though I put lots of comments in the code, I'll try to write an
abstract in the commit message.

> Thursday, May 3, 2018 2:17 AM, Yongseok Koh:
> > Subject: [dpdk-dev] [PATCH 3/5] net/mlx5: add new Memory Region support
> > 
> > This is the new design of Memory Region (MR) for mlx PMD, in order to:
> > - Accommodate the new memory hotplug model.
> > - Support non-contiguous Mempool.
> 
> This commit log is missing a lot of details about the design that you did. You
> must make it clear for every Mellanox PMD developer. 
> 
> Just to make sure I understand all the details:
> We have
> 1. Cache (L0) per rxq/txq in size of MLX5_MR_CACHE_N. searching in it starting from mru and fallback to linear search
> 2. btree (L1) per rxq/txq in dynamic size. searching using binary search. This is what you refer as the bottom half right? 

Right.

> 3. global mr cache (L2) per device in dynamic size (?) 

The size of the global btree table isn't dynamically increased. To make it
possible, there should be quite complex code work due to avoid deadlock. The
table can only be changed with locking and the rte_malloc() to expand the table
may raise memory hotplug event. Because it is very rare case, I decided to not
implement it. Instead, I made the global cache table twice bigger than the local
table. 

> 4. list of all MRs (L3) per device. 

This is the original data. Even if the global cache gets overflowed, the overall
process will be working fine but slow because it has to directly access this
list and the search will be done linearly by traversing the entry one by one.

[...]
> > diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
> > index 01d554758..2883f20af 100644
> > --- a/drivers/net/mlx5/mlx5.c
> > +++ b/drivers/net/mlx5/mlx5.c
> > @@ -41,6 +41,7 @@
> >  #include "mlx5_autoconf.h"
> >  #include "mlx5_defs.h"
> >  #include "mlx5_glue.h"
> > +#include "mlx5_mr.h"
> > 
> >  /* Device parameter to enable RX completion queue compression. */
> >  #define MLX5_RXQ_CQE_COMP_EN "rxq_cqe_comp_en"
> > @@ -84,10 +85,49 @@
> >  #define MLX5DV_CONTEXT_FLAGS_CQE_128B_COMP (1 << 4)
> >  #endif
> > 
> > +static const char *MZ_MLX5_PMD_SHARED_DATA =
> > "mlx5_pmd_shared_data";
> > +
> > +/* Shared memory between primary and secondary processes. */
> > +struct mlx5_shared_data *mlx5_shared_data;
> > +
> > +/* Spinlock for mlx5_shared_data allocation. */
> > +static rte_spinlock_t mlx5_shared_data_lock = RTE_SPINLOCK_INITIALIZER;
> > +
> >  /** Driver-specific log messages type. */
> >  int mlx5_logtype;
> > 
> >  /**
> > + * Prepare shared data between primary and secondary process.
> > + */
> > +static void
> > +mlx5_prepare_shared_data(void)
> > +{
> > +	const struct rte_memzone *mz;
> > +
> > +	rte_spinlock_lock(&mlx5_shared_data_lock);
> > +	if (mlx5_shared_data == NULL) {
> > +		if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
> > +			/* Allocate shared memory. */
> > +			mz =
> > rte_memzone_reserve(MZ_MLX5_PMD_SHARED_DATA,
> > +						 sizeof(*mlx5_shared_data),
> > +						 SOCKET_ID_ANY, 0);
> > +		} else {
> > +			/* Lookup allocated shared memory. */
> > +			mz =
> > rte_memzone_lookup(MZ_MLX5_PMD_SHARED_DATA);
> > +		}
> > +		if (mz == NULL)
> > +			rte_panic("Cannot allocate mlx5 shared data\n");
> > +		mlx5_shared_data = mz->addr;
> > +		/* Initialize shared data. */
> > +		if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
> > +			LIST_INIT(&mlx5_shared_data-
> > >mem_event_cb_list);
> > +			rte_rwlock_init(&mlx5_shared_data-
> > >mem_event_rwlock);
> > +		}
> > +	}
> > +	rte_spinlock_unlock(&mlx5_shared_data_lock);
> > +}
> > +
> 
> Can you elaborate why mlx5_shared_data can't be part of priv?  Priv is already
> allocated on the shared memory and rte_eth_dev layer enforce the secondary
> process creation as part of the rte_eth_dev_data allocation. 

Good question. The priv is a per-device structure but the callback mechanism is
not device-specific but global. The callback func has to iterate all the
registered devices. So, there has to be a place to locate the list head and its
lock, which are dpdk-global.

> > +/**
> >   * Retrieve integer value from environment variable.
> >   *
> >   * @param[in] name
> > @@ -201,6 +241,7 @@ mlx5_dev_close(struct rte_eth_dev *dev)
> >  		priv->txqs = NULL;
> >  	}
> >  	mlx5_flow_delete_drop_queue(dev);
> > +	mlx5_mr_release(dev);
> >  	if (priv->pd != NULL) {
> >  		assert(priv->ctx != NULL);
> >  		claim_zero(mlx5_glue->dealloc_pd(priv->pd));
> > @@ -633,6 +674,8 @@ mlx5_pci_probe(struct rte_pci_driver *pci_drv
> > __rte_unused,
> >  	struct ibv_counter_set_description cs_desc;
> >  #endif
> > 
> > +	/* Prepare shared data between primary and secondary process. */
> > +	mlx5_prepare_shared_data();
> >  	assert(pci_drv == &mlx5_driver);
> >  	/* Get mlx5_dev[] index. */
> >  	idx = mlx5_dev_idx(&pci_dev->addr);
> > @@ -1293,6 +1336,8 @@ rte_mlx5_pmd_init(void)
> >  	}
> >  	mlx5_glue->fork_init();
> >  	rte_pci_register(&mlx5_driver);
> > +	rte_mem_event_callback_register("MLX5_MEM_EVENT_CB",
> > +					mlx5_mr_mem_event_cb);
> 
> mlx5_mr_mem_event_cb requires PMD private structure. Is registering for the cb
> on the init makes sense? It looks like a better place is the PCI probe, after
> the eth_dev allocation. 

The callback func has to be registered once per a PMD. But the probe func is
called per a device.

[...]
> > diff --git a/drivers/net/mlx5/mlx5_ethdev.c
> > b/drivers/net/mlx5/mlx5_ethdev.c
> > index 746b94f73..6bb43cf4e 100644
> > --- a/drivers/net/mlx5/mlx5_ethdev.c
> > +++ b/drivers/net/mlx5/mlx5_ethdev.c
> > @@ -34,6 +34,7 @@
> >  #include <rte_interrupts.h>
> >  #include <rte_malloc.h>
> >  #include <rte_string_fns.h>
> > +#include <rte_rwlock.h>
> > 
> >  #include "mlx5.h"
> >  #include "mlx5_glue.h"
> > @@ -413,6 +414,21 @@ mlx5_dev_configure(struct rte_eth_dev *dev)
> >  		if (++j == rxqs_n)
> >  			j = 0;
> >  	}
> > +	/*
> > +	 * Once the device is added to the list of memory event callback, its
> > +	 * global MR cache table cannot be expanded on the fly because of
> > +	 * deadlock. If it overflows, lookup should be done by searching MR
> > list
> > +	 * linearly, which is slow.
> > +	 */
> > +	if (mlx5_mr_btree_init(&priv->mr.cache,
> > MLX5_MR_BTREE_CACHE_N * 2,
> 
> Why multiple by 2? Because it holds all the rxq/txq mrs? 

Like I mentioned, it can't be dynamically expanded due to deadlock. So, I wanted
to give enough space for the global table. However, there's no theory to
calculate what size is the best. I just doubled it by my gut feeling :-). The
size of a cache entry is 20B so the size of the table is 256 * 2 * 20B = 20KB.

Even though it can't be expanded, it will work fine by linearly accessing the
original MR list. It is slow but not a frequently accessed code.

> > +			       dev->device->numa_node)) {
> > +		/* rte_errno is already set. */
> > +		return -rte_errno;
> > +	}
> > +	rte_rwlock_write_lock(&mlx5_shared_data->mem_event_rwlock);
> > +	LIST_INSERT_HEAD(&mlx5_shared_data->mem_event_cb_list,
> > +			 priv, mem_event_cb);
> > +	rte_rwlock_write_unlock(&mlx5_shared_data-
> > >mem_event_rwlock);
> >  	return 0;
> >  }
> 
> Why registration is done only on configure and not on probe after priv
> initialization? 

Once a device is added to this list, the device will start to be searched by the
callback func. It would be better to defer the unnecessary calls until it is
really configured.

> > diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
> > index 736c40ae4..e964912bb 100644
> > --- a/drivers/net/mlx5/mlx5_mr.c
> > +++ b/drivers/net/mlx5/mlx5_mr.c
> > @@ -13,8 +13,1202 @@
> > 
> >  #include <rte_mempool.h>
> >  #include <rte_malloc.h>
> > +#include <rte_rwlock.h>
> > 
> >  #include "mlx5.h"
> > +#include "mlx5_mr.h"
> >  #include "mlx5_rxtx.h"
> >  #include "mlx5_glue.h"
> > 
> > +struct mr_find_contig_memsegs_data {
> > +	uintptr_t addr;
> > +	uintptr_t start;
> > +	uintptr_t end;
> > +	const struct rte_memseg_list *msl;
> > +};
> > +
> > +struct mr_update_mp_data {
> > +	struct rte_eth_dev *dev;
> > +	struct mlx5_mr_ctrl *mr_ctrl;
> > +	int ret;
> > +};
> > +
> > +/**
> > + * Expand B-tree table to a given size. Can't be called with holding
> > + * memory_hotplug_lock or priv->mr.rwlock due to rte_realloc().
> > + *
> > + * @param bt
> > + *   Pointer to B-tree structure.
> > + * @param n
> > + *   Number of entries for expansion.
> > + *
> > + * @return
> > + *   0 on success, -1 on failure.
> > + */
> > +static int
> > +mr_btree_expand(struct mlx5_mr_btree *bt, int n)
> > +{
> > +	void *mem;
> > +	int ret = 0;
> > +
> > +	if (n <= bt->size)
> > +		return ret;
> > +	/*
> > +	 * Downside of directly using rte_realloc() is that SOCKET_ID_ANY is
> > +	 * used inside if there's no room to expand. Because this is a quite
> > +	 * rare case and a part of very slow path, it is very acceptable.
> > +	 * Initially cache_bh[] will be given practically enough space and once
> > +	 * it is expanded, expansion wouldn't be needed again ever.
> > +	 */
> > +	mem = rte_realloc(bt->table, n * sizeof(struct mlx5_mr_cache), 0);
> > +	if (mem == NULL) {
> > +		/* Not an error, B-tree search will be skipped. */
> > +		DRV_LOG(WARNING, "failed to expand MR B-tree (%p)
> > table",
> > +			(void *)bt);
> 
> 
> DRV_LOG should have the port id of the device. For all of the DRV_LOG
> instances in the patch. 

This func isn't a device-specific one. I don't want to get port_id as an
argument only for debug message.

> Per my understating it falls back to the old bt in case the expansion failed,
> right? Bt searches will still happen. 

Right. If a datapath fails to expand its local cache, then it will keep having
misses on some entries and it will eventually be found on device cache by
mr_lookup_dev() in mlx5_mr_create(). In the mr_lookup_dev(), if even the global
cache is overflowed, the addr will have to be searched against the MR list.

[...]
> > +/**
> > + * Insert an entry to B-tree lookup table.
> > + *
> > + * @param bt
> > + *   Pointer to B-tree structure.
> > + * @param entry
> > + *   Pointer to new entry to insert.
> > + *
> > + * @return
> > + *   0 on success, -1 on failure.
> > + */
> > +static int
> > +mr_btree_insert(struct mlx5_mr_btree *bt, struct mlx5_mr_cache *entry)
> > +{
> > +	struct mlx5_mr_cache *lkp_tbl;
> > +	uint16_t idx = 0;
> > +	size_t shift;
> > +
> > +	assert(bt != NULL);
> > +	assert(bt->len <= bt->size);
> > +	assert(bt->len > 0);
> > +	lkp_tbl = *bt->table;
> > +	/* Find out the slot for insertion. */
> > +	if (mr_btree_lookup(bt, &idx, entry->start) != UINT32_MAX) {
> > +		DRV_LOG(DEBUG,
> > +			"abort insertion to B-tree(%p):"
> > +			" already exist at idx=%u [0x%lx, 0x%lx) lkey=0x%x",
> > +			(void *)bt, idx, entry->start, entry->end, entry-
> > >lkey);
> > +		/* Already exist, return. */
> > +		return 0;
> > +	}
> > +	/* If table is full, return error. */
> > +	if (unlikely(bt->len == bt->size)) {
> > +		bt->overflow = 1;
> > +		return -1;
> > +	}
> > +	/* Insert entry. */
> > +	++idx;
> > +	shift = (bt->len - idx) * sizeof(struct mlx5_mr_cache);
> > +	if (shift)
> > +		memmove(&lkp_tbl[idx + 1], &lkp_tbl[idx], shift);
> > +	lkp_tbl[idx] = *entry;
> > +	bt->len++;
> > +	DRV_LOG(DEBUG,
> > +		"inserted B-tree(%p)[%u], [0x%lx, 0x%lx) lkey=0x%x",
> > +		(void *)bt, idx, entry->start, entry->end, entry->lkey);
> > +	return 0;
> > +}
> 
> Can you elaborate on how you make sure the btree is always sorted based on the start addr? 

mr_btree_lookup() above will return the appropriate index to insert the new
entry to.

[...]
> > +/**
> > + * Free Memory Region (MR).
> > + *
> > + * @param dev
> > + *   Pointer to Ethernet device.
> > + * @param mr
> > + *   Pointer to MR to free.
> > + */
> > +void
> > +mlx5_mr_free(struct rte_eth_dev *dev, struct mlx5_mr *mr)
> > +{
> 
> Who calls this function? I didn't saw any. 

Right. When I wrote mlx5_mr_garbage_collect(), I should've deleted it.
Will remove.

> > +	struct priv *priv = dev->data->dev_private;
> > +
> > +	/* Detach from the list and free resources later. */
> > +	rte_rwlock_write_lock(&priv->mr.rwlock);
> > +	LIST_REMOVE(mr, mr);
> > +	rte_rwlock_write_unlock(&priv->mr.rwlock);
> > +	/*
> > +	 * rte_free() inside can't be called with holding the lock. This could
> > +	 * cause deadlock when calling free callback.
> > +	 */
> > +	mr_free(mr);
> > +	DRV_LOG(DEBUG, "port %u MR(%p) freed", dev->data->port_id,
> > (void *)mr);
> > +}
> > +
> > +/**
> > + * Releass resources of detached MR having no online entry.
> > + *
> > + * @param dev
> > + *   Pointer to Ethernet device.
> > + */
> > +static void
> > +mlx5_mr_garbage_collect(struct rte_eth_dev *dev)
> > +{
> > +	struct priv *priv = dev->data->dev_private;
> > +	struct mlx5_mr *mr_next;
> > +	struct mlx5_mr_list free_list = LIST_HEAD_INITIALIZER(free_list);
> > +
> > +	/* Must be called from the primary process. */
> > +	assert(rte_eal_process_type() == RTE_PROC_PRIMARY);
> 
> Perhaps it is better to have this check not under assert?

It is a static func called by mlx5_mr_create(), which prints out the error
message when sec proc attempts to create it. Looks okay to leave it as is.

[...]
> > +/**
> > + * Create a new global Memroy Region (MR) for a missing virtual address.
> > + * Register entire virtually contiguous memory chunk around the address.
> > + *
> > + * @param dev
> > + *   Pointer to Ethernet device.
> > + * @param[out] entry
> > + *   Pointer to returning MR cache entry, found in the global cache or newly
> > + *   created. If failed to create one, this will not be updated.
> > + * @param addr
> > + *   Target virtual address to register.
> > + *
> > + * @return
> > + *   Searched LKey on success, UINT32_MAX on failure and rte_errno is set.
> > + */
> > +static uint32_t
> > +mlx5_mr_create(struct rte_eth_dev *dev, struct mlx5_mr_cache *entry,
> > +	       uintptr_t addr)
> > +{
> > +	struct priv *priv = dev->data->dev_private;
> > +	struct rte_mem_config *mcfg = rte_eal_get_configuration()-
> > >mem_config;
> > +	const struct rte_memseg_list *msl;
> > +	const struct rte_memseg *ms;
> > +	struct mlx5_mr *mr = NULL;
> > +	size_t len;
> > +	uint32_t ms_n;
> > +	uint32_t bmp_size;
> > +	void *bmp_mem;
> > +	int ms_idx_shift = -1;
> > +	unsigned int n;
> > +	struct mr_find_contig_memsegs_data data = {
> > +		.addr = addr,
> > +	};
> > +	struct mr_find_contig_memsegs_data data_re;
> > +
> > +	DRV_LOG(DEBUG, "port %u creating a MR using address (%p)",
> > +		dev->data->port_id, (void *)addr);
> > +	if (rte_eal_process_type() != RTE_PROC_PRIMARY) {
> > +		DRV_LOG(WARNING,
> > +			"port %u using address (%p) of unregistered
> > mempool"
> > +			" in secondary process, please create mempool"
> > +			" before rte_eth_dev_start()",
> > +			dev->data->port_id, (void *)addr);
> > +		rte_errno = EPERM;
> > +		goto err_nolock;
> > +	}
> > +	/*
> > +	 * Release detached MRs if any. This can't be called with holding
> > either
> > +	 * memory_hotplug_lock or priv->mr.rwlock. MRs on the free list
> > have
> > +	 * been detached by the memory free event but it couldn't be
> > released
> > +	 * inside the callback due to deadlock. As a result, releasing resources
> > +	 * is quite opportunistic.
> > +	 */
> > +	mlx5_mr_garbage_collect(dev);
> > +	/*
> > +	 * Find out a contiguous virtual address chunk in use, to which the
> > +	 * given address belongs, in order to register maximum range. In the
> > +	 * best case where mempools are not dynamically recreated and
> > +	 * '--socket-mem' is speicified as an EAL option, it is very likely to
> > +	 * have only one MR(LKey) per a socket and per a hugepage-size
> > even
> > +	 * though the system memory is highly fragmented.
> > +	 */
> > +	if (!rte_memseg_contig_walk(mr_find_contig_memsegs_cb,
> > &data)) {
> > +		DRV_LOG(WARNING,
> > +			"port %u unable to find virtually contigous"
> > +			" chunk for address (%p)."
> > +			" rte_memseg_contig_walk() failed.",
> > +			dev->data->port_id, (void *)addr);
> > +		rte_errno = ENXIO;
> > +		goto err_nolock;
> > +	}
> > +alloc_resources:
> > +	/* Addresses must be page-aligned. */
> > +	assert(rte_is_aligned((void *)data.start, data.msl->page_sz));
> > +	assert(rte_is_aligned((void *)data.end, data.msl->page_sz));
> 
> Better to have this check outsize of assert. 

rte_memseg_contig_walk() guarantees it but it is just my paranoid as the API is
still experimental.

[...]
> > +/**
> > + * Look up address in the global MR cache table. If not found, create a new
> > MR.
> > + * Insert the found/created entry to local bottom-half cache table.
> > + *
> > + * @param dev
> > + *   Pointer to Ethernet device.
> > + * @param mr_ctrl
> > + *   Pointer to per-queue MR control structure.
> > + * @param[out] entry
> > + *   Pointer to returning MR cache entry, found in the global cache or newly
> > + *   created. If failed to create one, this is not written.
> > + * @param addr
> > + *   Search key.
> > + *
> > + * @return
> > + *   Searched LKey on success, UINT32_MAX on no match.
> > + */
> > +static uint32_t
> > +mlx5_mr_lookup_dev(struct rte_eth_dev *dev, struct mlx5_mr_ctrl
> > *mr_ctrl,
> > +		   struct mlx5_mr_cache *entry, uintptr_t addr)
> > +{
> > +	struct priv *priv = dev->data->dev_private;
> > +	struct mlx5_mr_btree *bt = &mr_ctrl->cache_bh;
> > +	uint16_t idx;
> > +	uint32_t lkey;
> > +
> > +	/* If local cache table is full, try to double it. */
> > +	if (unlikely(bt->len == bt->size))
> > +		mr_btree_expand(bt, bt->size << 1);
> > +	/* Look up in the global cache. */
> > +	rte_rwlock_read_lock(&priv->mr.rwlock);
> > +	lkey = mr_btree_lookup(&priv->mr.cache, &idx, addr);
> > +	if (lkey != UINT32_MAX) {
> > +		/* Found. */
> > +		*entry = (*priv->mr.cache.table)[idx];
> > +		rte_rwlock_read_unlock(&priv->mr.rwlock);
> > +		/*
> > +		 * Update local cache. Even if it fails, return the found entry
> > +		 * to update top-half cache. Next time, this entry will be
> > found
> > +		 * in the global cache.
> > +		 */
> > +		mr_btree_insert(bt, entry);
> > +		return lkey;
> > +	}
> > +	rte_rwlock_read_unlock(&priv->mr.rwlock);
> > +	/* First time to see the address? Create a new MR. */
> > +	lkey = mlx5_mr_create(dev, entry, addr);
> 
> Shouldn't we check if the add is not in the global mr list? For the case the global cache overflows? 

That is checked inside mlx5_mr_create() by mr_lookup_dev().

[...]
> > +/**
> > + * Bottom-half of LKey search on Rx.
> > + *
> > + * @param rxq
> > + *   Pointer to Rx queue structure.
> > + * @param addr
> > + *   Search key.
> > + *
> > + * @return
> > + *   Searched LKey on success, UINT32_MAX on no match.
> > + */
> > +uint32_t
> > +mlx5_rx_addr2mr_bh(struct mlx5_rxq_data *rxq, uintptr_t addr)
> > +{
> > +	struct mlx5_rxq_ctrl *rxq_ctrl =
> > +		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
> > +	struct mlx5_mr_ctrl *mr_ctrl = &rxq->mr_ctrl;
> > +	struct priv *priv = rxq_ctrl->priv;
> > +
> > +	DRV_LOG(DEBUG,
> > +		"Rx queue %u: miss on top-half, mru=%u, head=%u,
> > addr=%p",
> > +		rxq_ctrl->idx, mr_ctrl->mru, mr_ctrl->head, (void *)addr);
> > +	return mlx5_mr_addr2mr_bh(eth_dev(priv), mr_ctrl, addr);
> > +}
> 
> Shouldn't this code path be in the mlxx5_rxq? 

It is an interface between rxq and mr. I prefer to keep here for better
maintenance.

[...]
> > diff --git a/drivers/net/mlx5/mlx5_mr.h b/drivers/net/mlx5/mlx5_mr.h
> > new file mode 100644
> > index 000000000..a0a0ef755
> > --- /dev/null
> > +++ b/drivers/net/mlx5/mlx5_mr.h
> > @@ -0,0 +1,121 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright 2018 6WIND S.A.
> > + * Copyright 2018 Mellanox Technologies, Ltd
> > + */
> > +
> > +#ifndef RTE_PMD_MLX5_MR_H_
> > +#define RTE_PMD_MLX5_MR_H_
> > +
> > +#include <stddef.h>
> > +#include <stdint.h>
> > +#include <sys/queue.h>
> > +
> > +/* Verbs header. */
> > +/* ISO C doesn't support unnamed structs/unions, disabling -pedantic. */
> > +#ifdef PEDANTIC
> > +#pragma GCC diagnostic ignored "-Wpedantic"
> > +#endif
> > +#include <infiniband/verbs.h>
> > +#include <infiniband/mlx5dv.h>
> > +#ifdef PEDANTIC
> > +#pragma GCC diagnostic error "-Wpedantic"
> > +#endif
> > +
> > +#include <rte_eal_memconfig.h>
> > +#include <rte_ethdev.h>
> > +#include <rte_rwlock.h>
> > +#include <rte_bitmap.h>
> > +
> > +/* Memory Region object. */
> > +struct mlx5_mr {
> > +	LIST_ENTRY(mlx5_mr) mr; /**< Pointer to the prev/next entry. */
> > +	struct ibv_mr *ibv_mr; /* Verbs Memory Region. */
> > +	const struct rte_memseg_list *msl;
> > +	int ms_base_idx; /* Start index of msl->memseg_arr[]. */
> > +	int ms_n; /* Number of memsegs in use. */
> > +	uint32_t ms_bmp_n; /* Number of bits in memsegs bit-mask. */
> > +	struct rte_bitmap *ms_bmp; /* Bit-mask of memsegs belonged to
> > MR. */
> > +};
> > +
> > +/* Cache entry for Memory Region. */
> > +struct mlx5_mr_cache {
> > +	uintptr_t start; /* Start address of MR. */
> > +	uintptr_t end; /* End address of MR. */
> > +	uint32_t lkey; /* rte_cpu_to_be_32(ibv_mr->lkey). */
> > +} __rte_packed;
> > +
> > +/* MR Cache table for Binary search. */
> > +struct mlx5_mr_btree {
> > +	uint16_t len; /* Number of entries. */
> > +	uint16_t size; /* Total number of entries. */
> > +	int overflow; /* Mark failure of table expansion. */
> > +	struct mlx5_mr_cache (*table)[];
> > +} __rte_packed;
> > +
> > +/* Per-queue MR control descriptor. */
> > +struct mlx5_mr_ctrl {
> > +	uint32_t *dev_gen_ptr; /* Generation number of device to poll. */
> > +	uint32_t cur_gen; /* Generation number saved to flush caches. */
> > +	uint16_t mru; /* Index of last hit entry in top-half cache. */
> > +	uint16_t head; /* Index of the oldest entry in top-half cache. */
> > +	struct mlx5_mr_cache cache[MLX5_MR_CACHE_N]; /* Cache for
> > top-half. */
> > +	struct mlx5_mr_btree cache_bh; /* Cache for bottom-half. */
> > +} __rte_packed;
> > +
> > +/* First entry must be NULL for comparison. */
> > +#define MR_N(n) ((n) - 1)
> > +
> > +/* Whether there's only one entry in MR lookup table. */
> > +#define IS_SINGLE_MR(n) (MR_N(n) == 1)
> 
> MLX5_IS_SINGLE_MR

Replaced it with mlx5_mr_btree_len().


Thanks,
Yongseok


More information about the dev mailing list