Build farm grinds to a halt with inconsistent build

Bug #499421 reported by Michael Nelson
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Michael Nelson

Bug Description

We are occasionally seeing the situation where a job has a date_started but the associated build queue entry does not have a builder associated

In this situation the build farm grinds to a halt (eg. https://answers.edge.launchpad.net/soyuz/+question/94802).

As a temporary work-around, to determine whether it is the cause of a build farm situation the following query can be run: https://pastebin.canonical.com/25913/

And to set the (date_started, status) back to (null, WAITING): https://pastebin.canonical.com/26153/

Now that we've fixed the issue of the build-manager falling over in this situation, bug 527617 will track the remaining issue of the occasional inconsistent builds.

Tags: lp-soyuz

Related branches

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

At least BuildQueue.getActiveBuildJobs() needs to be revised to check for the presence of the builder.

Changed in soyuz:
milestone: none → 10.01
assignee: nobody → Michael Nelson (michael.nelson)
Revision history for this message
Michael Nelson (michael.nelson) wrote :

As far as we know, the inconsistent state is a result of manual SQL updates for builds.

IBuildQueue.markAsBuilding() sets the builder, IBuildQueue.reset() and updateBuild_ABORTED() both set it back to None and reset the date_started at the same time.

IBuildQueue.setDateStarted() seems to be only ever used to reset back to None. There are a few instances in lp.buildmaster.buildergroup.py which set builder to None, but it is always done at the same time as setting the date_started to None.

Tom has got instructions for LOSA's for how to fix the situation when it occurs, and has a branch ready for review that will allow a nagios check to alert automatically (I'll link it). I'm waiting to hear whether we've seen this occur since the last known occurrence on 2009-12-21.

Revision history for this message
Tom Haddon (mthaddon) wrote :

[2010-01-01 05:25:51] SERVICE ALERT: wildcherry;Build Job No Builder;CRITICAL;HARD;3;FAIL: there is a job that thinks it has started but has not got a builder assigned
[2010-01-01 05:24:51] SERVICE ALERT: wildcherry;Build Job No Builder;CRITICAL;SOFT;2;FAIL: there is a job that thinks it has started but has not got a builder assigned
[2010-01-01 05:23:51] SERVICE ALERT: wildcherry;Build Job No Builder;CRITICAL;SOFT;1;FAIL: there is a job that thinks it has started but has not got a builder assigned

Revision history for this message
Michael Nelson (michael.nelson) wrote :

Thanks Tom for the info.

It turns out that the fix that we're currently applying when this happens is the cause of a separate issue that we're seeing (bug 499095), as it resets the job's date_started, but leaves the job.status as RUNNING. When the related build is redispatched, our soyuz code does not set the job's date_started again as it (strangely) checks the job.status rather than whether the build already has a builder.

Anyway, I've updated the query that you're using to fix this current bug so that it also resets the job status to WAITING. Can you please update the LOSA docs with the new link:

https://pastebin.canonical.com/26153/

description: updated
Revision history for this message
Michael Nelson (michael.nelson) wrote :

See bug 503440 - it seems that one build that caused the build farm to grind to a halt did so after many non-eventful dispatches, which suggests that this bug itself wasn't triggered by corrupted data.

Revision history for this message
Michael Nelson (michael.nelson) wrote :

I'm marking this as fixed released, as we've done a number SQL data fixes, as well as the fix for bug 499095, and have not seen the buildfarm grinding to a halt since the 2010-01-01 event above.

If we do see this again with a build that has *not* been manually updated, we need to log all the relevant info for further investigation.

Changed in soyuz:
status: Triaged → Fix Released
Revision history for this message
Steve McInerney (spm) wrote :

This just triggered:
have run the above update, after doing a select to grab details:
https://pastebin.canonical.com/27577/

Revision history for this message
Michael Nelson (michael.nelson) wrote :

Initial investigation:
It's an amd64 build of thunderbird-3.0 3.0.2~hg20100209r4691+nobinonly-0ubuntu1~umd1~hardy in ubuntu hardy RELEASE
The build is oopsing as expected:
https://edge.launchpad.net/~ubuntu-mozilla-daily/+archive/ppa/+build/1492525
as the IBuild.buildstate is still BUILDING but the the related
BuildQueue.builder attribute is null (the actual bug).
The build doesn't have a buildlog.

The buildd logs are interesting - in that they don't exist?? Currently
buildd-manager.log.2 goes from 2010-02-07 18:04 - 2010-02-07 21:04
buildd-manager.log.1 goes from 2010-02-07 21:04 - 2010-02-07 23:47
buildd-manager.log goes from 2010-02-09 07:04 - present.
Hmm... did I miss something? If not, then perhaps a losa can let us know why
they stopped?

Other builds of the same package for different arches built fine in hardy,
and other builds of the same package on amd64 for different distroseries
built fine as well:
https://edge.launchpad.net/~ubuntu-mozilla-daily/+archive/ppa/+builds?build_text=thunderbird-3.0&build_state=all

Revision history for this message
Michael Nelson (michael.nelson) wrote :

This happened again today with the build 1524023.

We could avoid this issue blocking the buildfarm completely (although it would still block a builder), by ensuring that IBuildQueueSet.getActiveBuildJobs() does not return build queue items without a builder. I'll get a branch ready.

Revision history for this message
Michael Nelson (michael.nelson) wrote :

Twice in two days: build 1390987.

@bigjools: should we consider CPing the above fix that will ensure the bm doesn't get stuck in these situations?

We'll also try to get the relevant SQL logs to see if they tell us where the problem in the code lies.

Revision history for this message
Michael Nelson (michael.nelson) wrote :

Hi Tom, here's some local sql debug output showing what's executed when a build is dispatched:

http://pastebin.ubuntu.com/382820/

So we should be able to grep the logs for:

UPDATE Build SET buildstate=6 WHERE Build.id = 1390987

to find the related nearby statements that *should* have also set the builder (as in the paste).

Revision history for this message
Tom Haddon (mthaddon) wrote :

We don't keep logs of all statements, only long running ones. That query doesn't show up in the logs.

Revision history for this message
Julian Edwards (julian-edwards) wrote : Re: [Bug 499421] Re: Build farm grinds to a halt with inconsistent build

On Wednesday 24 February 2010 07:57:35 Michael Nelson wrote:
> @bigjools: should we consider CPing the above fix that will ensure the
> bm doesn't get stuck in these situations?
>

Yes please.

Changed in soyuz:
status: Fix Released → Triaged
milestone: 10.01 → 10.03
Revision history for this message
Michael Nelson (michael.nelson) wrote :

Now that we have a fix for the build-manager falling over in this situation, I've created bug 527617 to track the on-going issue and linked it back here for info.

Changed in soyuz:
milestone: 10.03 → 10.02
status: Triaged → Fix Committed
description: updated
Changed in soyuz:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.