[dpdk-dev] LLC miss in librte_distributor
jigsaw
jigsaw at gmail.com
Tue Nov 11 16:37:52 CET 2014
Hi Bruce,
I noticed that librte_distributor has quite sever LLC miss problem when
running on 16 cores.
While on 8 cores, there's no such problem.
The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32
cores on 2 sockets.
The test case is the distributor_perf_autotest, i.e.
in app/test/test_distributor_perf.c.
The test result is collected by command:
perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores ./test
-cff -n2 --no-huge
Note that test results show that with or without hugepage, the LCC miss
rate remains the same. So I will just show --no-huge config.
With 8 cores, the LLC miss rate is OK:
LLC-load-misses 26750
LLC-loads 93979233
LLC-store-misses 432263
LLC-stores 69954746
That is 0.028% of load miss and 0.62% of store miss.
With 16 cores, the LLC miss rate is very high:
LLC-load-misses 70263520
LLC-loads 143807657
LLC-store-misses 23115990
LLC-stores 63692854
That is 48.9% load miss and 36.3% store miss.
Most of the load miss happens at first line of rte_distributor_poll_pkt.
Most of the store miss happens at ... I don't know, because perf record on
LLC-store-misses brings down my machine.
It's not so straightforward to me how could this happen: 8 core fine, but
16 cores very bad.
My guess is that 16 cores bring in more QPI transaction between sockets?
Or 16 cores bring a different LLC access pattern?
So I tried to reduce the padding inside union rte_distributor_buffer from 3
cachelines to 1 cacheline.
- char pad[CACHE_LINE_SIZE*3];
+ char pad[CACHE_LINE_SIZE];
And it does have a obvious result:
LLC-load-misses 53159968
LLC-loads 167756282
LLC-store-misses 29012799
LLC-stores 63352541
Now it is 31.69% of load miss, and 45.79% of store miss.
It lows down the load miss rate, but raises the store miss rate.
Both numbers are still very high, sadly.
But the bright side is that it decrease the Time per burst and time per
packet.
The original version has:
=== Performance test of distributor ===
Time per burst: 8013
Time per packet: 250
And the patched ver has:
=== Performance test of distributor ===
Time per burst: 6834
Time per packet: 213
I tried a couple of other tricks. Such as adding more idle loops
in rte_distributor_get_pkt,
and making the rte_distributor_buffer thread_local to each worker core. But
none of this trick
has any noticeable outcome. These failures make me tend to believe the high
LLC miss rate
is related to QPI or NUMA. But my machine is not able to perf on uncore QPI
events so this
cannot be approved.
I cannot draw any conclusion or reveal the root cause after all. But I
suggest a further study on the performance bottleneck so as to find a good
solution.
thx &
rgds,
-qinglai
More information about the dev
mailing list