Launchpad itself

upgrade robustness for cronscripts

Bug #607391 reported by Robert Collins on 2010-07-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Fix Released	High	Stuart Bishop	Launchpad itself 10.10

Bug Description

We want to do more full-rollouts without adding downtime except when there are actual db patches to deploy.

We have ~ 170 cronscript instances spread over many machines and these are a likely source of fragility / percieved downtime.

Some possible problems:
- they run from cron, so either they try to run while we're upgrading, or they may still be running while we upgrade
- they run from the 'current' symlink, not the 'active revno dir', so they could in principle do late buggy imports (but this is rare, we can ignore for now)
- some of them have terrible knock on effects if interrupted, and take 15-20 minutes every hour.

Specifically: the publisher script takes ages and if interrupted makes a mess, so we will want to deploy *around* it.

I have in mind a single simple wrapper that we can surround all cronscripts with that will:
- check a single well known place to determine whether to load the script or not
- (optionally) not run scripts that take more than <estimate> leading up to a rollout (e.g. 1 minute scripts might still run)

However thats only a preconceived idea. The actual constraints are:
- be better than the current rollout process of mass crontab editing
- put the policy somewhere more central (e.g. db, config file, whatever)
- for specific highly sensitive cronscripts we may want some 'is it safe yet' check, but that shouldn't be conflated with how we coordinate whether things run or not, unless it makes sense.

The current process causes extended downtime windows by not running anything leading up to the rollout; part of the issue is that individual scripts can't tolerate other services (e.g. the xmlrpc server) going down - we may need a bunch of fine grained bugs to make them better, but the robustness thing here should at least let us start closing the gap.

I'm marking this as high because it will be hard for us to change our merge-qa-deploy workflow until we reduce the downtime of production non-db-patch rollouts, and I know everyone is keen to change that workflow :).

Tags:

Related branches

lp:~stub/launchpad/cronscripts

Merged into lp:launchpad at revision 11637

Robert Collins (community): Approve on 2010-09-23

Canonical Launchpad Engineering: Pending requested 2010-09-23

Revision history for this message

Gary Poster (gary) wrote on 2010-07-19:

Addressing 605822 at the same time, or at least preparing for it, would be nice.

Changed in launchpad-foundations:
status:	New → Triaged

Gary Poster (gary) on 2010-07-19

Changed in launchpad-foundations:
assignee:	nobody → Stuart Bishop (stub)

Stuart Bishop (stub) on 2010-08-06

tags:

added: cron

Revision history for this message

Launchpad QA Bot (lpqabot) wrote on 2010-09-18: Bug fixed by a commit

Fixed in stable r11567 <http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/11567>.

Changed in launchpad-foundations:
milestone:	none → 10.10
tags:	added: qa-needstesting
Changed in launchpad-foundations:
status:	Triaged → Fix Committed

Revision history for this message

Launchpad QA Bot (lpqabot) wrote on 2010-09-28:

Fixed in stable r11637 <http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/11637>.

Stuart Bishop (stub) on 2010-09-29

tags:

added: qa-ok
removed: qa-needstesting

Curtis Hovey (sinzui) on 2010-10-14

Changed in launchpad-foundations:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.