[dpdk-dev] LLC miss in librte_distributor

jigsaw jigsaw at gmail.com
Thu Nov 13 16:17:41 CET 2014


Hi,

Well, I give up the idea of optimizing QPI caused LLC miss.
The queue based messaging has even worse performance than polling a same
buf from both cores.
It is the nature of busy polling model.
I guess we have to accept it as a fact, unless the programming model can be
changed to a biased locking model,
which favors one lock-owner core. But unfortunately the biased locking
model doesn't seem to be applicable for distributor.

thx &
rgds,
-ql

On Wed, Nov 12, 2014 at 7:11 PM, jigsaw <jigsaw at gmail.com> wrote:

> Hi Bruce,
>
> Thanks for your reply.
>
> I agree that to logically divide the distributor functionality is the best
> solution.
>
> Meantime I tried some tricks and the result looks good: For same amount of
> pkts (1M), the LLC stores and loads decrease 90% percent, and the miss
> rates for both decrease to 25%.
> The L1 miss rate increase a bit, thought.
> Then the combined result is that the time spent decreases 50%.
> The main change I made is to use a FIFO to transfer the pkts from
> distributor to worker, while the current buf is only used as a signalling
> channel. This change has a very obvious effect on saving LLC access.
>
> However, the test is based on the simple test program, rather on DPDK
> application. So I will try same tricks on DPDK and see if it has same
> effect.
> Besides, I need more time to read a few more papers to get it right.
>
> I will try to propose a patch if I manage to get a positive result. It
> will take several days coz I'm not fully dedicated to this issue.
>
> I will come back with more details.
>
> BTW, I have another user story: a worker can asking distributor to
> schedule a pkt.
> It arises in such condition: After processing pkt with tag value 1, the
> worker changes it's tag to 2, so the distributor has to be
> asked to deliver the pkt with new tag value to proper worker.
> I already have the patch ready but I will hold it back until previous
> patch is committed.
> I need also your comments on this user story.
>
> thx &
> rgds,
> -ql
>
> On Wed, Nov 12, 2014 at 6:07 PM, Bruce Richardson <
> bruce.richardson at intel.com> wrote:
>
>> On Wed, Nov 12, 2014 at 10:37:33AM +0200, jigsaw wrote:
>> > Hi,
>> >
>> > OK it is now very clear it is due to memory transactions between
>> different
>> > nodes.
>> >
>> > The test program is here:
>> > https://gist.github.com/jigsawecho/6a2e78d65f0fe67adf1b
>> >
>> > The test machine topology is:
>> >
>> > NUMA node0 CPU(s):     0-7,16-23
>> > NUMA node1 CPU(s):     8-15,24-31
>> >
>> > Change the 3rd param from 0 to 1 at line 135, and the LLC cache load
>> miss
>> > boost from  0.09% to 33.45%.
>> > The LLC cache store miss boost from 0.027% to 50.695%.
>> >
>> > Clearly the root cause is transaction crossing the node boundary.
>> >
>> > But then how to resolve this problem is another topic...
>> >
>> > thx &
>> > rgds,
>> > -ql
>> >
>> >
>>
>> Having traffic cross QPI is always a problem, and there could be a number
>> of ways
>> to solve it. Probably the best solution is to have multiple NICs with some
>> directly connected to each socket, with the packets from each NIC
>> processed locally
>> on the socket that NIC is connected to.
>>
>> If that is not possible, then other solutions need to be looked at. E.g.
>> For an app
>> wanting to use a distributor, I would suggest investigating if two
>> distributors
>> could be used - one on each socket. Then use a ring to burst-transfer
>> large
>> groups of packets from one socket to another and then use the distributor
>> locally.
>> This would involve far less QPI traffic than using a distributor with
>> remote workers.
>>
>> Regards,
>> /Bruce
>>
>> >
>> > On Tue, Nov 11, 2014 at 5:37 PM, jigsaw <jigsaw at gmail.com> wrote:
>> >
>> > > Hi Bruce,
>> > >
>> > > I noticed that librte_distributor has quite sever LLC miss problem
>> when
>> > > running on 16 cores.
>> > > While on 8 cores, there's no such problem.
>> > > The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32
>> > > cores on 2 sockets.
>> > >
>> > > The test case is the distributor_perf_autotest, i.e.
>> > > in app/test/test_distributor_perf.c.
>> > > The test result is collected by command:
>> > >
>> > > perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores
>> ./test
>> > > -cff -n2 --no-huge
>> > >
>> > > Note that test results show that with or without hugepage, the LCC
>> miss
>> > > rate remains the same. So I will just show --no-huge config.
>> > >
>> > > With 8 cores, the LLC miss rate is OK:
>> > >
>> > > LLC-load-misses  26750
>> > > LLC-loads  93979233
>> > > LLC-store-misses  432263
>> > > LLC-stores  69954746
>> > >
>> > > That is 0.028% of load miss and 0.62% of store miss.
>> > >
>> > > With 16 cores, the LLC miss rate is very high:
>> > >
>> > > LLC-load-misses  70263520
>> > > LLC-loads  143807657
>> > > LLC-store-misses  23115990
>> > > LLC-stores  63692854
>> > >
>> > > That is 48.9% load miss and 36.3% store miss.
>> > >
>> > > Most of the load miss happens at first line of
>> rte_distributor_poll_pkt.
>> > > Most of the store miss happens at ... I don't know, because perf
>> record on
>> > > LLC-store-misses brings down my machine.
>> > >
>> > > It's not so straightforward to me how could this happen: 8 core fine,
>> but
>> > > 16 cores very bad.
>> > > My guess is that 16 cores bring in more QPI transaction between
>> sockets?
>> > > Or 16 cores bring a different LLC access pattern?
>> > >
>> > > So I tried to reduce the padding inside union rte_distributor_buffer
>> from
>> > > 3 cachelines to 1 cacheline.
>> > >
>> > > -     char pad[CACHE_LINE_SIZE*3];
>> > > +    char pad[CACHE_LINE_SIZE];
>> > >
>> > > And it does have a obvious result:
>> > >
>> > > LLC-load-misses  53159968
>> > > LLC-loads  167756282
>> > > LLC-store-misses  29012799
>> > > LLC-stores  63352541
>> > >
>> > > Now it is 31.69% of load miss, and 45.79% of store miss.
>> > >
>> > > It lows down the load miss rate, but raises the store miss rate.
>> > > Both numbers are still very high, sadly.
>> > > But the bright side is that it decrease the Time per burst and time
>> per
>> > > packet.
>> > >
>> > > The original version has:
>> > > === Performance test of distributor ===
>> > > Time per burst:  8013
>> > > Time per packet: 250
>> > >
>> > > And the patched ver has:
>> > > === Performance test of distributor ===
>> > > Time per burst:  6834
>> > > Time per packet: 213
>> > >
>> > >
>> > > I tried a couple of other tricks. Such as adding more idle loops
>> > > in rte_distributor_get_pkt,
>> > > and making the rte_distributor_buffer thread_local to each worker
>> core.
>> > > But none of this trick
>> > > has any noticeable outcome. These failures make me tend to believe the
>> > > high LLC miss rate
>> > > is related to QPI or NUMA. But my machine is not able to perf on
>> uncore
>> > > QPI events so this
>> > > cannot be approved.
>> > >
>> > >
>> > > I cannot draw any conclusion or reveal the root cause after all. But I
>> > > suggest a further study on the performance bottleneck so as to find a
>> good
>> > > solution.
>> > >
>> > > thx &
>> > > rgds,
>> > > -qinglai
>> > >
>> > >
>>
>
>


More information about the dev mailing list