[dpdk-dev] [PATCH v1 0/6] mempool: add bucket driver

Olivier Matz olivier.matz at 6wind.com
Thu Apr 19 18:41:10 CEST 2018
Previous message: [dpdk-dev] [PATCH v2 1/2] mem: check if allocation size is too big
Next message: [dpdk-dev] [PATCH v1 3/6] mempool: support block dequeue operation
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Andrew,

Sorry for the late feedback, few comments below.

On Mon, Mar 26, 2018 at 05:12:53PM +0100, Andrew Rybchenko wrote:
> The initial patch series [1] (RFCv1 is [2]) is split into two to simplify
> processing.  It is the second part which relies on the first one [3].
> 
> It should be applied on top of [4] and [3].
> 
> The patch series adds bucket mempool driver which allows to allocate
> (both physically and virtually) contiguous blocks of objects and adds
> mempool API to do it. It is still capable to provide separate objects,
> but it is definitely more heavy-weight than ring/stack drivers.
> The driver will be used by the future Solarflare driver enhancements
> which allow to utilize physical contiguous blocks in the NIC firmware.
> 
> The target usecase is dequeue in blocks and enqueue separate objects
> back (which are collected in buckets to be dequeued). So, the memory
> pool with bucket driver is created by an application and provided to
> networking PMD receive queue. The choice of bucket driver is done using
> rte_eth_dev_pool_ops_supported(). A PMD that relies upon contiguous
> block allocation should report the bucket driver as the only supported
> and preferred one.
> 
> Introduction of the contiguous block dequeue operation is proven by
> performance measurements using autotest with minor enhancements:
>  - in the original test bulks are powers of two, which is unacceptable
>    for us, so they are changed to multiple of contig_block_size;
>  - the test code is duplicated to support plain dequeue and
>    dequeue_contig_blocks;
>  - all the extra test variations (with/without cache etc) are eliminated;
>  - a fake read from the dequeued buffer is added (in both cases) to
>    simulate mbufs access.
> 
> start performance test for bucket (without cache)
> mempool_autotest cache=   0 cores= 1 n_get_bulk=  15 n_put_bulk=   1 n_keep=  30 Srate_persec=   111935488
> mempool_autotest cache=   0 cores= 1 n_get_bulk=  15 n_put_bulk=   1 n_keep=  60 Srate_persec=   115290931
> mempool_autotest cache=   0 cores= 1 n_get_bulk=  15 n_put_bulk=  15 n_keep=  30 Srate_persec=   353055539
> mempool_autotest cache=   0 cores= 1 n_get_bulk=  15 n_put_bulk=  15 n_keep=  60 Srate_persec=   353330790
> mempool_autotest cache=   0 cores= 2 n_get_bulk=  15 n_put_bulk=   1 n_keep=  30 Srate_persec=   224657407
> mempool_autotest cache=   0 cores= 2 n_get_bulk=  15 n_put_bulk=   1 n_keep=  60 Srate_persec=   230411468
> mempool_autotest cache=   0 cores= 2 n_get_bulk=  15 n_put_bulk=  15 n_keep=  30 Srate_persec=   706700902
> mempool_autotest cache=   0 cores= 2 n_get_bulk=  15 n_put_bulk=  15 n_keep=  60 Srate_persec=   703673139
> mempool_autotest cache=   0 cores= 4 n_get_bulk=  15 n_put_bulk=   1 n_keep=  30 Srate_persec=   425236887
> mempool_autotest cache=   0 cores= 4 n_get_bulk=  15 n_put_bulk=   1 n_keep=  60 Srate_persec=   437295512
> mempool_autotest cache=   0 cores= 4 n_get_bulk=  15 n_put_bulk=  15 n_keep=  30 Srate_persec=  1343409356
> mempool_autotest cache=   0 cores= 4 n_get_bulk=  15 n_put_bulk=  15 n_keep=  60 Srate_persec=  1336567397
> start performance test for bucket (without cache + contiguous dequeue)
> mempool_autotest cache=   0 cores= 1 n_get_bulk=  15 n_put_bulk=   1 n_keep=  30 Crate_persec=   122945536
> mempool_autotest cache=   0 cores= 1 n_get_bulk=  15 n_put_bulk=   1 n_keep=  60 Crate_persec=   126458265
> mempool_autotest cache=   0 cores= 1 n_get_bulk=  15 n_put_bulk=  15 n_keep=  30 Crate_persec=   374262988
> mempool_autotest cache=   0 cores= 1 n_get_bulk=  15 n_put_bulk=  15 n_keep=  60 Crate_persec=   377316966
> mempool_autotest cache=   0 cores= 2 n_get_bulk=  15 n_put_bulk=   1 n_keep=  30 Crate_persec=   244842496
> mempool_autotest cache=   0 cores= 2 n_get_bulk=  15 n_put_bulk=   1 n_keep=  60 Crate_persec=   251618917
> mempool_autotest cache=   0 cores= 2 n_get_bulk=  15 n_put_bulk=  15 n_keep=  30 Crate_persec=   751226060
> mempool_autotest cache=   0 cores= 2 n_get_bulk=  15 n_put_bulk=  15 n_keep=  60 Crate_persec=   756233010
> mempool_autotest cache=   0 cores= 4 n_get_bulk=  15 n_put_bulk=   1 n_keep=  30 Crate_persec=   462068120
> mempool_autotest cache=   0 cores= 4 n_get_bulk=  15 n_put_bulk=   1 n_keep=  60 Crate_persec=   476997221
> mempool_autotest cache=   0 cores= 4 n_get_bulk=  15 n_put_bulk=  15 n_keep=  30 Crate_persec=  1432171313
> mempool_autotest cache=   0 cores= 4 n_get_bulk=  15 n_put_bulk=  15 n_keep=  60 Crate_persec=  1438829771
> 
> The number of objects in the contiguous block is a function of bucket
> memory size (.config option) and total element size. In the future
> additional API with possibility to pass parameters on mempool allocation
> may be added.
> 
> It breaks ABI since changes rte_mempool_ops. The ABI version is already
> bumped in [4].
> 
> 
> [1] https://dpdk.org/ml/archives/dev/2018-January/088698.html
> [2] https://dpdk.org/ml/archives/dev/2017-November/082335.html
> [3] https://dpdk.org/ml/archives/dev/2018-March/093807.html
> [4] https://dpdk.org/ml/archives/dev/2018-March/093196.html


As discussed privately, at first glance I was a bit reticent to
introduce a new API in mempool that will only be available in one
mempool driver.

There have been the same debate for several features of ethdev: should
we provide a generic API for a feature available in only one driver? Or
should the driver provide its own API?

Given that the mempool driver API is not that big currently, and that it
can bring a performance enhancement (which is the primary goal of DPDK),
I think we can give a chance to this patchset. Drivers that want to use
this new bucket driver can take advantage of the new API, keeping a
fallback mode to still be working with other mempool drivers.

The bet is:

- drivers and aplication try the bucket driver and its new API, and
  they notice a performance increase
- the get_contig_block API is implemented in some other drivers, if
  possible (not easy for default one at least)
- the bucket driver could become the default driver, if the performance
  increase is significant and wide.

By integrating this patchset, I hope we can also have some feedback
about the performance of this driver in other situations. My worries are
about pipeline+multicore use-cases, where it may add some pressure
(races conditions?) on the bucket header.

Finally, I think (as discussed privately) that the tests should be
updated to be able to reproduce the tests in this cover letter. I just
did a quick test by replacing "stack" by "bucket" in autotest (see patch
below) and it fails in populate(). I did not investigate more, maybe the
parameters are not correct for bucket.

 --- a/test/test/test_mempool.c
 +++ b/test/test/test_mempool.c
 @@ -498,7 +498,7 @@ test_mempool(void)
                 printf("cannot allocate mp_stack mempool\n");
                 goto err;
         }
 -       if (rte_mempool_set_ops_byname(mp_stack, "stack", NULL) < 0) {
 +       if (rte_mempool_set_ops_byname(mp_stack, "bucket", NULL) < 0) {
                 printf("cannot set stack handler\n");
                 goto err;
         }



Thank you for this work, hope we'll be on time for rc1.
Olivier
Previous message: [dpdk-dev] [PATCH v2 1/2] mem: check if allocation size is too big
Next message: [dpdk-dev] [PATCH v1 3/6] mempool: support block dequeue operation
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list