rebuilding replicas causes all background tasks to stall (no differentiation between latency sensitive tasks and others)

Bug #622670 reported by Данило Шеган
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Unassigned

Bug Description

We use DBLoopTuner for all of our data model migration scripts. However, when a single slave in a replication set is being rebuilt, entire cluster lag skyrockets and any migration job we are doing stalls for ~12h or so (the time it takes to fully rebuild a slave).

While this is bad for migration scripts, it makes DBLoopTuner unusable for regular cronscripts. I've even have a case where it makes the script underperform when using DBLoopTuner: cronscripts/rosetta-pofile-stats-daily.py takes ages to finish even though it processes a relatively small number of rows (bug 622668).

Solutions
=========

* Allow DBLoopTuner to be configured to ignore lag, for things that should well, ignore lag.

Revision history for this message
Stuart Bishop (stub) wrote :

This is by design. When a slave is being rebuilt, an open transaction is kept open for several hours. We don't want data changes being made during this window because of database bloat. We don't want scripts that do bulk changes running when there are long running transactions open.

If it is critical things never block, don't use DBLoopTuner. Its purpose is to block when the DB is unhappy. Use LoopTuner instead.

If we want things to block, but not so aggressively, it needs to be opt in. We could either add flags to DBLoopTuner to not perform some checks, or alternative implementations that strike a different balance between remaining responsive and keeping the database happy.

Revision history for this message
Stuart Bishop (stub) wrote :

We can do this right now by setting long_running_transaction to None.

Changed in launchpad-foundations:
status: New → Invalid
Stuart Bishop (stub)
Changed in launchpad-foundations:
status: Invalid → Confirmed
importance: Undecided → Medium
Revision history for this message
Данило Шеган (danilo) wrote :

As we discussed on IRC, even setting long_running_transaction to None won't solve everything because we are still reading the cluster lag. When slave is being rebuilt, this is going to grow out of hand as well and any scripts running as part of DBLoopTuner would block. Re-opening.

Changed in launchpad-foundations:
status: Confirmed → New
status: New → Confirmed
Revision history for this message
Stuart Bishop (stub) wrote :

So long_running_transaction can be used to tune or turn off the long running transaction checks.

acceptable_replication_lag can be used to tune the cluster lag checks.

If we want to avoid blocking on a new node rebuild, we need to change the lag check from cluster lag to (all nodes in the cluster that are not currently being built) lag.

Curtis Hovey (sinzui)
Changed in launchpad:
status: Confirmed → Triaged
Revision history for this message
Stuart Bishop (stub) wrote :

With Slony, there is no way of telling which nodes are currently being built. We would have to store this state somewhere and ensure it is updated somehow. This seems overkill, as we normally only build a new node once per month.

Changed in launchpad:
status: Triaged → Won't Fix
Revision history for this message
Robert Collins (lifeless) wrote : Re: rebuilding replicas causes all background tasks to stall

I think its worth figuring out and doing because even though its monthly it is very disruptive; in particular it makes opening new distroreleases just after a db deploy a problem, which we don't need.

Changed in launchpad:
status: Won't Fix → Triaged
importance: Medium → High
summary: - DBLoopTuner stalls over DB slave rebuilds
+ rebuilding replicas causes all background tasks to stall
Revision history for this message
Stuart Bishop (stub) wrote :

Danilo suggested allowing things to continue if lag is 'too high'. I've thought about this, and it seems fine as the only time lag is 'too high' is when alerts have been raised and people are having to fix things, or we are in a known laggy situation such as rebuilding replicas.

So i think we can change the loop tuner to only block if replication lag is between 5 minutes and 45 minutes.

Revision history for this message
Robert Collins (lifeless) wrote :

So, we no longer rebuild replicas every month; just when we add nodes which is very rare.

That said, I think we can discriminate between 'big scanning jobs that should respect lag' and 'lively maintenance' that should ignore lag.

description: updated
summary: - rebuilding replicas causes all background tasks to stall
+ rebuilding replicas causes all background tasks to stall (no
+ differentiation between latency sensitive tasks and others)
William Grant (wgrant)
Changed in launchpad:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.