Reconnection code doesn't work... again

Bug #427990 reported by Francis J. Lacoste
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Invalid
High
Unassigned

Bug Description

The 2009-09-10 disk space incident (https://wiki.canonical.com/IncidentReports/2009-09-10-Wildcherry-Diskspace) seem to have caused the app servers to lose their DB connection. Once the incident was over, the app servers didn't reconnect to the DB until they were restarted.

Here is a traceback from the app server log:

2009-09-10T20:30:07 ERROR SiteError https://translations.launchpad.net
Traceback (most recent call last):
  File "/srv/launchpad.net/production/launchpad-rev-8337/lib/zope/publisher/publish.py", line 127
, in publish
    publication.beforeTraversal(request)
  File "/srv/launchpad.net/production/launchpad-rev-8337/lib/canonical/launchpad/webapp/publicati
on.py", line 180, in beforeTraversal
    principal = self.getPrincipal(request)
  File "/srv/launchpad.net/production/launchpad-rev-8337/lib/canonical/launchpad/webapp/publicati
on.py", line 208, in getPrincipal
    principal = auth_utility.authenticate(request)
  File "/srv/launchpad.net/production/launchpad-rev-8337/lib/canonical/launchpad/webapp/authentic
ation.py", line 125, in authenticate
    return self._authenticateUsingCookieAuth(request)
  File "/srv/launchpad.net/production/launchpad-rev-8337/lib/canonical/launchpad/webapp/authentication.py", line 88, in _authenticateUsingCookieAuth
    principal = login_src.getPrincipal(id)
  File "/srv/launchpad.net/production/launchpad-rev-8337/lib/canonical/launchpad/webapp/authentication.py", line 226, in getPrincipal
    return self._principalForAccount(account, access_level, scope)
  File "/srv/launchpad.net/production/launchpad-rev-8337/lib/canonical/launchpad/webapp/authentication.py", line 284, in _principalForAccount
    access_level=access_level, scope=scope)
  File "/srv/launchpad.net/production/launchpad-rev-8337/lib/canonical/launchpad/webapp/authentication.py", line 307, in __init__
    self.person = IPerson(account, None)
  File "/srv/launchpad.net/production/launchpad-rev-8337/lib/zope/app/component/hooks.py", line 95, in adapter_hook
    return siteinfo.adapter_hook(interface, object, name, default)
  File "/srv/launchpad.net/production/launchpad-rev-8337/lib/zope/security/adapter.py", line 84, in __call__
    adapter = self.factory(*args)
  File "/srv/launchpad.net/production/launchpad-rev-8337/lib/lp/registry/model/person.py", line 3835, in person_from_account
    person = IStore(Person).find(Person, account=account).one()
  File "/srv/launchpad.net/production/launchpad-rev-8337/eggs/storm-0.14trunk_321-py2.4-linux-x86:
_64.egg/storm/store.py", line 1114, in one
    result = self._store._connection.execute(select)
  File "/srv/launchpad.net/production/launchpad-rev-8337/eggs/storm-0.14trunk_321-py2.4-linux-x86_64.egg/storm/databases/postgres.py", line 249, in execute
    return Connection.execute(self, statement, params, noresult)
  File "/srv/launchpad.net/production/launchpad-rev-8337/eggs/storm-0.14trunk_321-py2.4-linux-x86_64.egg/storm/database.py", line 207, in execute
    self._ensure_connected()
  File "/srv/launchpad.net/production/launchpad-rev-8337/eggs/storm-0.14trunk_321-py2.4-linux-x86_64.egg/storm/database.py", line 320, in _ensure_connected
    raise DisconnectionError("Already disconnected")
DisconnectionError: Already disconnected

Revision history for this message
Francis J. Lacoste (flacoste) wrote :

MIght be the same thing than bug #420804, although in this case the DB was restarted. But since that bug doesn't have any exception message, we cannot yet determined if it's the same code path.

Revision history for this message
Francis J. Lacoste (flacoste) wrote :

The connections were reset by the database because a backend process crashed:

2009-09-10 20:23:19 BST WARNING: terminating connection because of crash of another server process
2009-09-10 20:23:19 BST DETAIL: The postmaster has commanded this server process to roll back the current transacti
on and exit, because another server process exited abnormally and possibly corrupted shared memory.
2009-09-10 20:23:19 BST HINT: In a moment you should be able to reconnect to the database and repeat your command.

Gary Poster (gary)
Changed in launchpad-foundations:
assignee: nobody → Stuart Bishop (stub)
Revision history for this message
Francis J. Lacoste (flacoste) wrote :

See the thousands of DisconnectionError in https://devpad.canonical.com/~matsubara/oops-summaries/lpnet-2009-09-10.html like OOPS-1349XMLP108

tags: added: oops
Revision history for this message
Stuart Bishop (stub) wrote :

The first reference I can find is this from the PostgreSQL master's log file:

<@:24515> 2009-09-10 20:23:19 BST LOG: server process (PID 19692) was terminate
d by signal 9: Killed

Looks like somebody kill -9'd a backend process, which is bad - when the backend process terminates without cleanup, PostgreSQL has to restart everything because shared resources are in an unknown state. Its an instant outage button basically.

The OOPS reports are not useful to us, as we now log an OOPS for a disconnection error before we reconnect and retry the transaction. Seeing thousands of OOPSes like this when the database is in recovery mode is normal. We may want to reconsider this decision.

The appservers should reconnect once the database is back on its feet. We have tests for this behavior even, so I'm not sure what is going wrong there.

Revision history for this message
Stuart Bishop (stub) wrote :

I can't reproduce the reconnection failure locally. I suspect it is broken on the production branch, but working on launchpad/devel and db-devel.

Did the edge servers reconnect correctly? If they did, it confirms that this is already fixed on launchpad/devel (probably with the Storm 0.15 work done earlier).

Can we reproduce this state on staging? If we can't, the issue is likely fixed on launchpad/db-devel.

Changed in launchpad-foundations:
status: Triaged → Incomplete
Revision history for this message
Stuart Bishop (stub) wrote :

An attempt to reproduce the reconnection failure on staging failed.

Revision history for this message
Tom Haddon (mthaddon) wrote : Re: [Bug 427990] Re: Reconnection code doesn't work... again

On Tue, 2009-09-15 at 08:14 +0000, Stuart Bishop wrote:
> I can't reproduce the reconnection failure locally. I suspect it is
> broken on the production branch, but working on launchpad/devel and db-
> devel.
>
> Did the edge servers reconnect correctly? If they did, it confirms that
> this is already fixed on launchpad/devel (probably with the Storm 0.15
> work done earlier).

The edge servers had to be restarted as well, so I don't think that's
it.

>
> Can we reproduce this state on staging? If we can't, the issue is likely
> fixed on launchpad/db-devel.
>
> ** Changed in: launchpad-foundations
> Status: Triaged => Incomplete
>

Gary Poster (gary)
Changed in launchpad-foundations:
status: Incomplete → Triaged
Revision history for this message
Stuart Bishop (stub) wrote :

There is still no way to progress as we can't reproduce this on staging. I'm still hopeful that this has been fixed on launchpad/db-devel and this is the reason we can't reproduce it.

Changed in launchpad-foundations:
status: Triaged → Incomplete
Curtis Hovey (sinzui)
Changed in launchpad-foundations:
milestone: 3.0 → 3.1.11
Stuart Bishop (stub)
Changed in launchpad-foundations:
milestone: 3.1.11 → none
assignee: Stuart Bishop (stub) → nobody
Revision history for this message
Robert Collins (lifeless) wrote :

Closing it as its just noise currently; obviously if it happens again we can dig into it more and try again to reproduce.

Changed in launchpad-foundations:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.