Buildd-manager should deal with transient communication failures with builders

Bug #369109 reported by Celso Providelo
30
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Julian Edwards

Bug Description

As described in bug #343683 and bug #31546, builders are subjected to transient communication failures and shouldn't be immediately marked as NOT_OK when it happens.

Instead we should allow an acceptable period of unavailability before excluding the builder from the poll, this way builders victims of network hiccups, which have continue with their jobs won't mistakenly be reset.

On the flip-side, builders excluded from the poll should be re-probed periodically, so they are automatically made available again once whatever was preventing them to build was fixed.

Related branches

tags: added: buildd-manager
Revision history for this message
Julian Edwards (julian-edwards) wrote :

Today we had this in the log:

2010-04-15 23:53:23+0100 [-] Starting scanning cycle.
2010-04-15 23:56:46+0100 [-] Disabling builder: http://gourd.buildd:8221/ -- timed out
2010-04-15 23:56:46+0100 [-] Traceback (most recent call last):
2010-04-15 23:56:46+0100 [-] File "/srv/launchpad.net/codelines/soyuz-production-rev-9191/lib/lp/buildmaster/model/builder.py", line 205, in updateBuilderStatus
2010-04-15 23:56:46+0100 [-] builder.checkSlaveAlive()
2010-04-15 23:56:46+0100 [-] File "/srv/launchpad.net/codelines/soyuz-production-rev-9191/lib/lp/buildmaster/model/builder.py", line 320, in checkSlaveAlive
2010-04-15 23:56:46+0100 [-] if self.slave.echo("Test")[0] != "Test":
2010-04-15 23:56:46+0100 [-] File "/usr/lib/python2.5/xmlrpclib.py", line 1147, in __call__
2010-04-15 23:56:46+0100 [-] return self.__send(self.__name, args)
2010-04-15 23:56:46+0100 [-] File "/usr/lib/python2.5/xmlrpclib.py", line 1437, in __request
2010-04-15 23:56:46+0100 [-] verbose=self.__verbose
2010-04-15 23:56:46+0100 [-] File "/usr/lib/python2.5/xmlrpclib.py", line 1185, in request
2010-04-15 23:56:46+0100 [-] errcode, errmsg, headers = h.getreply()
2010-04-15 23:56:46+0100 [-] File "/usr/lib/python2.5/httplib.py", line 1199, in getreply
2010-04-15 23:56:46+0100 [-] response = self._conn.getresponse()
2010-04-15 23:56:46+0100 [-] File "/usr/lib/python2.5/httplib.py", line 928, in getresponse
2010-04-15 23:56:46+0100 [-] response.begin()
2010-04-15 23:56:46+0100 [-] File "/usr/lib/python2.5/httplib.py", line 385, in begin
2010-04-15 23:56:46+0100 [-] version, status, reason = self._read_status()
2010-04-15 23:56:46+0100 [-] File "/usr/lib/python2.5/httplib.py", line 343, in _read_status
2010-04-15 23:56:46+0100 [-] line = self.fp.readline()
2010-04-15 23:56:46+0100 [-] File "/usr/lib/python2.5/socket.py", line 331, in readline
2010-04-15 23:56:46+0100 [-] data = recv(1)
2010-04-15 23:56:46+0100 [-] timeout: timed out

It seems as though there are two competing ways of timing stuff out.
1. the code in lib/lp/buildmaster/manager.py (QueryWithTimeoutProtocol)
2. lib/lp/buildmaster/model/builder.py (TimeoutTransport)

Different actions seem to cause timeouts in each of these. This is crap.

It also seems as though the updateBuilderStatus() should catch the above timeout exception. When it doesn't it will produce the traceback as above and leave the builder disabled but with the build still on it.

Curtis Hovey (sinzui)
Changed in soyuz:
assignee: Celso Providelo (cprov) → nobody
Changed in soyuz:
status: Triaged → In Progress
assignee: nobody → Julian Edwards (julian-edwards)
milestone: pending → none
Revision history for this message
Launchpad QA Bot (lpqabot) wrote : Bug fixed by a commit
Changed in soyuz:
milestone: none → 10.09
tags: added: qa-needstesting
Changed in soyuz:
status: In Progress → Fix Committed
Revision history for this message
Julian Edwards (julian-edwards) wrote :

I think this is fixed, although it's very hard to test. Please re-open this bug if it's not.

tags: added: qa-untestable
removed: qa-needstesting
Curtis Hovey (sinzui)
Changed in soyuz:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.