[PATCH 1/1] mempool: implement index-based per core cache
Dharmik Thakkar
Dharmik.Thakkar at arm.com
Thu Mar 24 20:51:43 CET 2022
Hi,
Thank you for the comments!
Based on the suggestions, I tested the patch for single core L3Fwd performance with increased number of routes/flows (maximum 8K) to increase cache footprint.
However, I don’t see much improvement with the patch.
> On Jan 21, 2022, at 5:25 AM, Ananyev, Konstantin <konstantin.ananyev at intel.com> wrote:
>
>
>
> Hi Dharmik,
>>>
>>>>>
>>>>>> Current mempool per core cache implementation stores pointers to mbufs
>>>>>> On 64b architectures, each pointer consumes 8B
>>>>>> This patch replaces it with index-based implementation,
>>>>>> where in each buffer is addressed by (pool base address + index)
>>>>>> It reduces the amount of memory/cache required for per core cache
>>>>>>
>>>>>> L3Fwd performance testing reveals minor improvements in the cache
>>>>>> performance (L1 and L2 misses reduced by 0.60%)
>>>>>> with no change in throughput
>>>>>
>>>>> I feel really sceptical about that patch and the whole idea in general:
>>>>> - From what I read above there is no real performance improvement observed.
>>>>> (In fact on my IA boxes mempool_perf_autotest reports ~20% slowdown,
>>>>> see below for more details).
>>>>
>>>> Currently, the optimizations (loop unroll and vectorization) are only implemented for ARM64.
>>>> Similar optimizations can be implemented for x86 platforms which should close the performance gap
>>>> and in my understanding should give better performance for a bulk size of 32.
>>>
>>> Might be, but I still don't see the reason for such effort.
>>> As you mentioned there is no performance improvement in 'real' apps: l3fwd, etc.
>>> on ARM64 even with vectorized version of the code.
>>>
>>
>> IMO, even without performance improvement, it is advantageous because the same performance is being achieved
>> with less memory and cache utilization using the patch.
>>
>>>>> - Space utilization difference looks neglectable too.
>>>>
>>>> Sorry, I did not understand this point.
>>>
>>> As I understand one of the expectations from that patch was:
>>> reduce memory/cache required, which should improve cache utilization
>>> (less misses, etc.).
>>> Though I think such improvements would be neglectable and wouldn't
>>> cause any real performance gain.
>>
>> The cache utilization performance numbers are for the l3fwd app, which might not be bottlenecked at the mempool per core cache.
>> Theoretically, this patch enables storing twice the number of objects in the cache as compared to the original implementation.
>
> It saves you 4 just bytes per mbuf.
> Even for simple l2fwd-like workload we access ~100 bytes per mbuf.
> Let's do a simplistic estimation of number of affected cache-lines l for l2fwd.
> For bulk of 32 packets, assuming 64B per cache-line and 16B per HW desc:
>
> number of cache-lines accessed
> cache with pointers / cache with indexes
> mempool_get: (32*8)/64=4 / (32*4)/64=2
> RX (read HW desc): (32*16)/64=8 / (32*16)/64=8
> RX (write mbuf fields, 1st cache line): (32*64)/64=3 / (32*64)/64=32
> update mac addrs: (32*64)/64=32 / (32*64)/64=32
> TX (write HW desc): (32*16)/64=8 / (32*16)/64=8
> free mbufs (read 2nd mbuf cache line): (32*64)/64=32 / (32*64)/64=32
> mempool_put: (32*8)/64=4 / (32*4)/64=2
> total: 120 116
>
> So, if my calculations are correct, max estimated gain for cache utilization would be:
> (120-116)*100/120=3.33%
> Note that numbers are for over-simplistic usage scenario.
> In more realistic ones, when we have to touch more cache-lines per packet,
> that difference would be even less noticeable.
> So I really doubt we will see some noticeable improvements in terms of cache utilization
> with that patch.
>
>>>
>>>>> - The change introduces a new build time config option with a major limitation:
>>>>> All memzones in a pool have to be within the same 4GB boundary.
>>>>> To address it properly, extra changes will be required in init(/populate) part of the code.
>>>>
>>>> I agree to the above mentioned challenges and I am currently working on resolving these issues.
>>>
>>> I still think that to justify such changes some really noticeable performance
>>> improvement needs to be demonstrated: double-digit speedup for l3fwd/ipsec-secgw/...
>>> Otherwise it just not worth the hassle.
>>>
>>
>> Like I mentioned earlier, the app might not be bottlenecked at the mempool per core cache.
>> That could be the reason the numbers with l3fwd don’t fully show the advantage of the patch.
>
> As I said above, I don’t think we'll see any real advantage here.
> But feel free to pick-up different app and prove me wrong.
> After all we have plenty of sample apps that do provide enough
> pressure on the cache: l3fwd-acl, ipsec-secgw.
> Or you can even apply these patches from Sean:
> https://patches.dpdk.org/project/dpdk/list/?series=20999
> to run l3fwd with configurable routes.
> That should help you to make it cache-bound.
>
Thank you, Konstantin! This patch was helpful.
>> I’m seeing double-digit improvement with mempool_perf_autotest which should not be ignored.
>
> And for other we are seeing double digit degradation.
> So far the whole idea doesn't look promising at all, at least to me.
> Konstantin
>
More information about the dev
mailing list