[dpdk-dev] LLC miss in librte_distributor

jigsaw jigsaw at gmail.com
Wed Nov 12 09:37:33 CET 2014


Hi,

OK it is now very clear it is due to memory transactions between different
nodes.

The test program is here:
https://gist.github.com/jigsawecho/6a2e78d65f0fe67adf1b

The test machine topology is:

NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31

Change the 3rd param from 0 to 1 at line 135, and the LLC cache load miss
boost from  0.09% to 33.45%.
The LLC cache store miss boost from 0.027% to 50.695%.

Clearly the root cause is transaction crossing the node boundary.

But then how to resolve this problem is another topic...

thx &
rgds,
-ql



On Tue, Nov 11, 2014 at 5:37 PM, jigsaw <jigsaw at gmail.com> wrote:

> Hi Bruce,
>
> I noticed that librte_distributor has quite sever LLC miss problem when
> running on 16 cores.
> While on 8 cores, there's no such problem.
> The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32
> cores on 2 sockets.
>
> The test case is the distributor_perf_autotest, i.e.
> in app/test/test_distributor_perf.c.
> The test result is collected by command:
>
> perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores ./test
> -cff -n2 --no-huge
>
> Note that test results show that with or without hugepage, the LCC miss
> rate remains the same. So I will just show --no-huge config.
>
> With 8 cores, the LLC miss rate is OK:
>
> LLC-load-misses  26750
> LLC-loads  93979233
> LLC-store-misses  432263
> LLC-stores  69954746
>
> That is 0.028% of load miss and 0.62% of store miss.
>
> With 16 cores, the LLC miss rate is very high:
>
> LLC-load-misses  70263520
> LLC-loads  143807657
> LLC-store-misses  23115990
> LLC-stores  63692854
>
> That is 48.9% load miss and 36.3% store miss.
>
> Most of the load miss happens at first line of rte_distributor_poll_pkt.
> Most of the store miss happens at ... I don't know, because perf record on
> LLC-store-misses brings down my machine.
>
> It's not so straightforward to me how could this happen: 8 core fine, but
> 16 cores very bad.
> My guess is that 16 cores bring in more QPI transaction between sockets?
> Or 16 cores bring a different LLC access pattern?
>
> So I tried to reduce the padding inside union rte_distributor_buffer from
> 3 cachelines to 1 cacheline.
>
> -     char pad[CACHE_LINE_SIZE*3];
> +    char pad[CACHE_LINE_SIZE];
>
> And it does have a obvious result:
>
> LLC-load-misses  53159968
> LLC-loads  167756282
> LLC-store-misses  29012799
> LLC-stores  63352541
>
> Now it is 31.69% of load miss, and 45.79% of store miss.
>
> It lows down the load miss rate, but raises the store miss rate.
> Both numbers are still very high, sadly.
> But the bright side is that it decrease the Time per burst and time per
> packet.
>
> The original version has:
> === Performance test of distributor ===
> Time per burst:  8013
> Time per packet: 250
>
> And the patched ver has:
> === Performance test of distributor ===
> Time per burst:  6834
> Time per packet: 213
>
>
> I tried a couple of other tricks. Such as adding more idle loops
> in rte_distributor_get_pkt,
> and making the rte_distributor_buffer thread_local to each worker core.
> But none of this trick
> has any noticeable outcome. These failures make me tend to believe the
> high LLC miss rate
> is related to QPI or NUMA. But my machine is not able to perf on uncore
> QPI events so this
> cannot be approved.
>
>
> I cannot draw any conclusion or reveal the root cause after all. But I
> suggest a further study on the performance bottleneck so as to find a good
> solution.
>
> thx &
> rgds,
> -qinglai
>
>


More information about the dev mailing list