[dpdk-dev] Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update() function?

Ananyev, Konstantin konstantin.ananyev at intel.com
Fri Apr 20 12:05:45 CEST 2018



> -----Original Message-----
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Shailja Pandey
> Sent: Thursday, April 19, 2018 3:30 PM
> To: Wiles, Keith <keith.wiles at intel.com>
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update()
> function?
> 
> > The two code fragments are doing two different ways the first is using a loop to create possible more then one replication and the second
> one is not, correct? The loop can cause performance hits, but should be small.
> Sorry for the confusion, for memcpy version also we are using a loop
> outside of this function. Essentially, we are making same number of
> copies in both the cases.
> > The first one is using the hdr->next pointer which is in the second cacheline of the mbuf header, this can and will cause a cacheline miss
> and degrade your performance. The second code does not touch hdr->next and will not cause a cacheline miss. When the packet goes
> beyond 64bytes then you hit the second cacheline, are you starting to see the problem here.
> We also performed same experiment for different packet sizes(64B, 128B,
> 256B, 512B, 1024B, 1518B), the sharp drop in throughput is observed only
> when the packet size increases from 64B to 128B and not after that. So,
> cacheline miss should happen for other packet sizes also. I am not sure
> why this is the case. Why the drop is not sharp after 128 B packets when
> replicated using rte_pktmbuf_refcnt_update().
> 
> >   Every time you touch a new cache line performance will drop unless the cacheline is prefetched into memory first, but in this case it
> really can not be done easily. Count the cachelines you are touching and make sure they are the same number in each case.
> I don't understand the complexity here, could you please explain it in
> detail.
> >
> > Why did you use memcpy and not rte_memcpy here as rte_memcpy should be faster?
> >
> > I believe now DPDK has a rte_pktmbuf_alloc_bulk() function to reduce the number of rte_pktmbuf_alloc() calls, which should help if you
> know the number of packets you need to replicate up front.
> We are already using both of these functions, just to simplify the
> pseudo-code I used memcpy and rte_pktmbuf_alloc().
> 
> # pktsz 1(64 bytes)    |   pktsz 2(128 bytes)     |  pktsz 3(256
> bytes)    |  pktsz 4(512 bytes)   | pktsz 4(1024 bytes)    |
> # memcpy    refcnt    |   memcpy    refcnt      | memcpy refcnt       |
> memcpy  refcnt       | memcpy   refcnt         |
>     5949888    5806720|   5831360    2890816  |  5640379    2886016 |
> 5107840   2863264  | 4510121   2692876    |
> 
> Throughput is in MPPS.
> 

Wonder what NIC and TX function do you use?
Any chance that multi-seg support is not on?
Konstantin


More information about the dev mailing list