Launchpad itself

rebuilding replicas causes all background tasks to stall (no differentiation between latency sensitive tasks and others)

Bug #622670 reported by Данило Шеган on 2010-08-23

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Fix Released	High	Unassigned

Bug Description

We use DBLoopTuner for all of our data model migration scripts. However, when a single slave in a replication set is being rebuilt, entire cluster lag skyrockets and any migration job we are doing stalls for ~12h or so (the time it takes to fully rebuild a slave).

While this is bad for migration scripts, it makes DBLoopTuner unusable for regular cronscripts. I've even have a case where it makes the script underperform when using DBLoopTuner: cronscripts/rosetta-pofile-stats-daily.py takes ages to finish even though it processes a relatively small number of rows (bug 622668).

Solutions
=========

* Allow DBLoopTuner to be configured to ignore lag, for things that should well, ignore lag.

See original description

Tags:

Revision history for this message

Stuart Bishop (stub) wrote on 2010-08-23:

This is by design. When a slave is being rebuilt, an open transaction is kept open for several hours. We don't want data changes being made during this window because of database bloat. We don't want scripts that do bulk changes running when there are long running transactions open.

If it is critical things never block, don't use DBLoopTuner. Its purpose is to block when the DB is unhappy. Use LoopTuner instead.

If we want things to block, but not so aggressively, it needs to be opt in. We could either add flags to DBLoopTuner to not perform some checks, or alternative implementations that strike a different balance between remaining responsive and keeping the database happy.

Revision history for this message

Stuart Bishop (stub) wrote on 2010-08-23:

We can do this right now by setting long_running_transaction to None.

Changed in launchpad-foundations:
status:	New → Invalid

Stuart Bishop (stub) on 2010-08-23

Changed in launchpad-foundations:
status:	Invalid → Confirmed
importance:	Undecided → Medium

Revision history for this message

Данило Шеган (danilo) wrote on 2010-08-23:

As we discussed on IRC, even setting long_running_transaction to None won't solve everything because we are still reading the cluster lag. When slave is being rebuilt, this is going to grow out of hand as well and any scripts running as part of DBLoopTuner would block. Re-opening.

Changed in launchpad-foundations:
status:	Confirmed → New
status:	New → Confirmed

Revision history for this message

Stuart Bishop (stub) wrote on 2010-08-23:

So long_running_transaction can be used to tune or turn off the long running transaction checks.

acceptable_replication_lag can be used to tune the cluster lag checks.

If we want to avoid blocking on a new node rebuild, we need to change the lag check from cluster lag to (all nodes in the cluster that are not currently being built) lag.

Curtis Hovey (sinzui) on 2010-12-23

Changed in launchpad:
status:	Confirmed → Triaged

Revision history for this message

Stuart Bishop (stub) wrote on 2011-02-07:

With Slony, there is no way of telling which nodes are currently being built. We would have to store this state somewhere and ensure it is updated somehow. This seems overkill, as we normally only build a new node once per month.

Changed in launchpad:
status:	Triaged → Won't Fix

Revision history for this message

Robert Collins (lifeless) wrote on 2011-05-06: Re: rebuilding replicas causes all background tasks to stall

I think its worth figuring out and doing because even though its monthly it is very disruptive; in particular it makes opening new distroreleases just after a db deploy a problem, which we don't need.

Changed in launchpad:
status:	Won't Fix → Triaged
importance:	Medium → High
summary:	- DBLoopTuner stalls over DB slave rebuilds + rebuilding replicas causes all background tasks to stall

Revision history for this message

Stuart Bishop (stub) wrote on 2011-05-06:

Danilo suggested allowing things to continue if lag is 'too high'. I've thought about this, and it seems fine as the only time lag is 'too high' is when alerts have been raised and people are having to fix things, or we are in a known laggy situation such as rebuilding replicas.

So i think we can change the loop tuner to only block if replication lag is between 5 minutes and 45 minutes.

Revision history for this message

Robert Collins (lifeless) wrote on 2012-01-05:

So, we no longer rebuild replicas every month; just when we add nodes which is very rare.

That said, I think we can discriminate between 'big scanning jobs that should respect lag' and 'lively maintenance' that should ignore lag.

Robert Collins (lifeless) on 2012-01-05

description:

updated

Robert Collins (lifeless) on 2012-01-16

summary:

- rebuilding replicas causes all background tasks to stall
+ rebuilding replicas causes all background tasks to stall (no
+ differentiation between latency sensitive tasks and others)

William Grant (wgrant) on 2013-08-12

Changed in launchpad:
status:	Triaged → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.