[dpdk-dev] Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update() function?
Ananyev, Konstantin
konstantin.ananyev at intel.com
Fri Apr 20 12:05:45 CEST 2018
> -----Original Message-----
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Shailja Pandey
> Sent: Thursday, April 19, 2018 3:30 PM
> To: Wiles, Keith <keith.wiles at intel.com>
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update()
> function?
>
> > The two code fragments are doing two different ways the first is using a loop to create possible more then one replication and the second
> one is not, correct? The loop can cause performance hits, but should be small.
> Sorry for the confusion, for memcpy version also we are using a loop
> outside of this function. Essentially, we are making same number of
> copies in both the cases.
> > The first one is using the hdr->next pointer which is in the second cacheline of the mbuf header, this can and will cause a cacheline miss
> and degrade your performance. The second code does not touch hdr->next and will not cause a cacheline miss. When the packet goes
> beyond 64bytes then you hit the second cacheline, are you starting to see the problem here.
> We also performed same experiment for different packet sizes(64B, 128B,
> 256B, 512B, 1024B, 1518B), the sharp drop in throughput is observed only
> when the packet size increases from 64B to 128B and not after that. So,
> cacheline miss should happen for other packet sizes also. I am not sure
> why this is the case. Why the drop is not sharp after 128 B packets when
> replicated using rte_pktmbuf_refcnt_update().
>
> > Every time you touch a new cache line performance will drop unless the cacheline is prefetched into memory first, but in this case it
> really can not be done easily. Count the cachelines you are touching and make sure they are the same number in each case.
> I don't understand the complexity here, could you please explain it in
> detail.
> >
> > Why did you use memcpy and not rte_memcpy here as rte_memcpy should be faster?
> >
> > I believe now DPDK has a rte_pktmbuf_alloc_bulk() function to reduce the number of rte_pktmbuf_alloc() calls, which should help if you
> know the number of packets you need to replicate up front.
> We are already using both of these functions, just to simplify the
> pseudo-code I used memcpy and rte_pktmbuf_alloc().
>
> # pktsz 1(64 bytes) | pktsz 2(128 bytes) | pktsz 3(256
> bytes) | pktsz 4(512 bytes) | pktsz 4(1024 bytes) |
> # memcpy refcnt | memcpy refcnt | memcpy refcnt |
> memcpy refcnt | memcpy refcnt |
> 5949888 5806720| 5831360 2890816 | 5640379 2886016 |
> 5107840 2863264 | 4510121 2692876 |
>
> Throughput is in MPPS.
>
Wonder what NIC and TX function do you use?
Any chance that multi-seg support is not on?
Konstantin
More information about the dev
mailing list