[dpdk-dev] [PATCH v2] mempool: replace c memcpy code semantics with optimized rte_memcpy

Olivier MATZ olivier.matz at 6wind.com
Thu Jun 2 23:16:16 CEST 2016

Previous message: [dpdk-dev] [PATCH v2] mempool: replace c memcpy code semantics with optimized rte_memcpy
Next message: [dpdk-dev] [PATCH v2] mempool: replace c memcpy code semantics with optimized rte_memcpy
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Jerin,

On 06/02/2016 11:39 AM, Jerin Jacob wrote:
> On Thu, Jun 02, 2016 at 09:36:34AM +0200, Olivier MATZ wrote:
>> I think the LIFO behavior should occur on a per-bulk basis. I mean,
>> it should behave like in the exemplaes below:
>>
>>   // pool cache is in state X
>>   obj1 = mempool_get(mp)
>>   obj2 = mempool_get(mp)
>>   mempool_put(mp, obj2)
>>   mempool_put(mp, obj1)
>>   // pool cache is back in state X
>>
>>   // pool cache is in state X
>>   bulk1 = mempool_get_bulk(mp, 16)
>>   bulk2 = mempool_get_bulk(mp, 16)
>>   mempool_put_bulk(mp, bulk2, 16)
>>   mempool_put_bulk(mp, bulk1, 16)
>>   // pool cache is back in state X
>>
> 
> Per entry LIFO behavior make more sense in _bulk_ case as recently en-queued buffer
> comes out for next "get" makes more chance that buffer in Last level cache.

Yes, from a memory cache perspective, I think you are right.

In practise, I'm not sure it's so important because on many hw drivers,
a packet buffer returns to the pool only after a round of the tx ring.
So I'd say it won't make a big difference here.

>> Note that today it's not the case for bulks, since object addresses
>> are reversed only in get(), we are not back in the original state.
>> I don't really see the advantage of this.
>>
>> Removing the reversing may accelerate the cache in case of bulk get,
>> I think.
> 
> I tried in my setup, it was dropping the performance. Have you got
> improvement in any setup?

I know that the mempool_perf autotest is not representative
of a real use case, but it gives a trend. I did a quick test with
- the legacy code,
- the rte_memcpy in put()
- the rte_memcpy in both put() and get() (no reverse)
It seems that removing the reversing brings ~50% of enhancement
with bulks of 32 (on an westmere):

legacy
mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=839922483
mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=849792204
mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=1617022156
mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=1675087052
mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=3202914713
mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=3268725963

rte_memcpy in put() (your patch proposal)
mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=762157465
mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=744593817
mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=1500276326
mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=1461347942
mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=2974076107
mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=2928122264

rte_memcpy in put() and get()
mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=974834892
mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=1129329459
mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=2147798220
mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=2232457625
mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=4510816664
mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=4582421298

This is probably more a measure of the pure CPU cost of the mempool
function, without considering the memory cache aspect. So, of course,
a real use-case test should be done to confirm or not that it increases
the performance. I'll manage to do a test and let you know the result.

By the way, not all drivers are allocating or freeing the mbufs by
bulk, so this modification would only affect these ones. What driver
are you using for your test?


Regards,
Olivier

Previous message: [dpdk-dev] [PATCH v2] mempool: replace c memcpy code semantics with optimized rte_memcpy
Next message: [dpdk-dev] [PATCH v2] mempool: replace c memcpy code semantics with optimized rte_memcpy
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the dev mailing list