[dpdk-dev] [PATCH] vhost: improve dirty pages logging performance

Michael S. Tsirkin mst at redhat.com
Fri May 4 20:54:17 CEST 2018


On Fri, May 04, 2018 at 05:48:05PM +0200, Maxime Coquelin wrote:
> Hi Tiwei,
> 
> On 05/03/2018 01:56 PM, Tiwei Bie wrote:
> > On Mon, Apr 30, 2018 at 05:59:54PM +0200, Maxime Coquelin wrote:
> > > This patch caches all dirty pages logging until the used ring index
> > > is updated. These dirty pages won't be accessed by the guest as
> > > long as the host doesn't give them back to it by updating the
> > > index.
> > 
> > Below sentence in above commit message isn't the reason why
> > we can cache the dirty page logging. Right?
> > 
> > """
> > These dirty pages won't be accessed by the guest as
> > long as the host doesn't give them back to it by updating the
> > index.
> > """
> > 
> > > 
> > > The goal of this optimization is to fix a performance regression
> > > introduced when the vhost library started to use atomic operations
> > > to set bits in the shared dirty log map. While the fix was valid
> > > as previous implementation wasn't safe against concurent accesses,
> > > contention was induced.
> > > 
> > > With this patch, during migration, we have:
> > > 1. Less atomic operations as only a single atomic OR operation
> > > per 32 pages.
> > 
> > Why not do it per 64 pages?
> 
> I wasn't sure it would be supported on 32bits CPU, but [0] seems do
> indicate that only 128bits atomic operations aren't supported on all
> architectures.
> 
> I will change to do it per 64 pages.

""
The four non-arithmetic functions (load, store, exchange, and
compare_exchange) all have a generic version as well. This generic
version works on any data type. It uses the lock-free built-in function
if the specific data type size makes that possible; otherwise, an
external call is left to be resolved at run time. This external call is
the same format with the addition of a ‘size_t’ parameter inserted as
the first parameter indicating the size of the object being pointed to.
All objects must be the same size.
""

So I suspect it isn't well supported on all 32 bit CPUs, but
can't you do an atomic using long? That will do the right
thing depending on the CPU.

If not it's pretty easy to something along the lines of
#if BIT_PER_LONG == 64

> [0]: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html#g_t_005f_005fatomic-Builtins
> > > 2. Less atomic operations as during a burst, the same page will
> > > be marked dirty only once.
> > > 3. Less write memory barriers.
> > > 
> > > Fixes: 897f13a1f726 ("vhost: make page logging atomic")
> > > 
> > > Cc: stable at dpdk.org
> > > 
> > > Suggested-by: Michael S. Tsirkin <mst at redhat.com>
> > > Signed-off-by: Maxime Coquelin <maxime.coquelin at redhat.com>
> > > ---
> > > 
> > > Hi,
> > > 
> > > This series was tested with migrating a guest while running PVP
> > > benchmark at 1Mpps with both ovs-dpdk and testpmd as vswitch.
> > 
> > If the throughput is higher (e.g. by adding more cores
> > and queues), will the live migration fail due to the
> > higher dirty page generating speed?
> 
> I haven't done the check, but I agree that the higher is the throughput,
> the longer will be the migration duration and the higher the risk it
> never converges.

Right. It's not a good reason to just randomly slow down networking :)
In fact if data ends up on the same page, there is a chance that instead
of dirtying the same page multiple times we will dirty it once, which
will help migration converge.

> In this case of scenario, postcopy live-migration way be a better fit,
> as the hot pages will be copied only once. Postcopy support for
> vhost-user have been added to QEMU in v2.12, and I have a prototype for
> DPDK that I plan to propose for next release.
> 
> > 
> > > 
> > > With this patch we recover the packet drops regressions seen since
> > > the use of atomic operations to log dirty pages.
> > [...]
> > > +static __rte_always_inline void
> > > +vhost_log_cache_sync(struct virtio_net *dev, struct vhost_virtqueue *vq)
> > > +{
> > > +	uint32_t *log_base;
> > > +	int i;
> > > +
> > > +	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> > > +		   !dev->log_base))
> > > +		return;
> > > +
> > > +	log_base = (uint32_t *)(uintptr_t)dev->log_base;
> > > +
> > > +	/* To make sure guest memory updates are committed before logging */
> > > +	rte_smp_wmb();
> > 
> > It seems that __sync_fetch_and_or() can be considered a full
> > barrier [1]. So do we really need this rte_smp_wmb()?
> 
> That's a good point, thanks for the pointer.
> 
> > Besides, based on the same doc [1], it seems that the __sync_
> > version is deprecated in favor of the __atomic_ one.
> 
> I will change to __atomic_. For the memory model, do you agree I should
> use __ATOMIC_SEQ_CST?
> 
> > [1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html
> > 
> > > +
> > > +	for (i = 0; i < vq->log_cache_nb_elem; i++) {
> > > +		struct log_cache_entry *elem = vq->log_cache + i;
> > > +
> > > +		__sync_fetch_and_or(log_base + elem->offset, elem->val);
> > > +	}
> > > +
> > > +	vq->log_cache_nb_elem = 0;
> > > +}
> > > +
> > [...]
> > 
> 
> Thanks,
> Maxime


More information about the dev mailing list