appserver isn't recovering like it should causing too many oopses

Bug #360846 reported by Ursula Junque
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Critical
Francis J. Lacoste
Storm
Fix Released
Undecided
Francis J. Lacoste

Bug Description

It causes the following message, as seen in OOPS-1194D1005:
  InterfaceError: connection already closed

More: OOPS-1194D1001, OOPS-1194D1002, OOPS-1194D1003, OOPS-1194D1117, OOPS-1194D1132, OOPS-1194D1134.
Another bunch in 04/23: OOPS-1209A1332, OOPS-1209A1336, OOPS-1209A1341, OOPS-1209A1337, OOPS-1209A1339, OOPS-1209A1342.

The root cause is that storm doesn't detect that InterfaceError: connection already closed is a form of disconnection failure

Related branches

Ursula Junque (ursinha)
Changed in launchpad-foundations:
assignee: nobody → Stuart Bishop (stub)
Ursula Junque (ursinha)
description: updated
tags: added: oops
Revision history for this message
Francis J. Lacoste (flacoste) wrote :

This happened twice this week. Once on Monday after adding two new appservers for sso. And once today, cause still unknown.

Changed in launchpad-foundations:
importance: Undecided → High
milestone: none → 2.2.5
status: New → Triaged
Revision history for this message
Ursula Junque (ursinha) wrote :

We had another incident causing another bunch of oopses today, 05/05, and elmo restarted lpnet again. flacoste said stub will work on a fix for this bug this week (not today - holiday for him).

Changed in launchpad-foundations:
importance: High → Critical
Revision history for this message
Tom Haddon (mthaddon) wrote :

Causing outages on lpnet

Changed in launchpad-foundations:
assignee: Stuart Bishop (stub) → Francis J. Lacoste (flacoste)
description: updated
Changed in launchpad-foundations:
status: Triaged → In Progress
Revision history for this message
Francis J. Lacoste (flacoste) wrote :

Merged on rev 122 of our storm branch.

Changed in launchpad-foundations:
status: In Progress → Fix Committed
Revision history for this message
Francis J. Lacoste (flacoste) wrote :

The fix isn't working.

We do have a better understanding of what is happening. https://pastebin.canonical.com/17341/ shows a bunch of InterfaceError OOPS that happens. We see that one stretch of OOPS begins at OOPS-I2432. If we look at the previous OOPS-I2431, we see that an OperationalError was raised during the call to set_statement_timeout. Because that method doesn't use the _check_disconnect wrapper, the error is uncaught and leaves the connection in a disconnected state.

The following request gets an InterfaceError because of this. The suggested fix isn't enough because is_disconnection_error() is only called when the error is a subclass of DatabaseError. This isn't the case for InterfaceError.

We have no idea why the initial OperationError is happening. What is weird is that the client reports that the server disconnects it, but on the server, we see that it closes the client socket two minutes after the error. See https://pastebin.canonical.com/17342/ (to compare the timestamps).

Changed in launchpad-foundations:
status: Fix Committed → In Progress
Revision history for this message
Diogo Matsubara (matsubara) wrote :

I think Francis meant OOPS-1223I2432 and OOPS-1223I2431 respectively.

Revision history for this message
Diogo Matsubara (matsubara) wrote :

In https://devpad.canonical.com/~matsubara/bug-360846-oops-summary.html you can find a oops summary for the lpnet9 (oops_prefix: I) instance only, between 2009-05-07 21:33:00UTC and 2009-05-07 21:39:00UTC

The 5 OperationalError have one thing in common, they're all nagios checks.

Revision history for this message
Francis J. Lacoste (flacoste) wrote :

In https://devpad.canonical.com/~matsubara/2009-05-07-tmp-summary.html, we see a similar pattern. This OOPS report spans the 50 minutes after the restart of lpnet9 before InterfaceError starts creeping up.

Revision history for this message
Fernando Miguel (fernandomiguel) wrote :

i would just to add that even OpenID is failing too (maybe timeouts?), making it hard to even login to the wiki pages (and other services users use OIP with)

Revision history for this message
Diogo Matsubara (matsubara) wrote :

I correlated the information from app server launchpad9.log (lpnet9 -- oops_prefix: I) restart times (restart times in BST) with oops summaries (oops summaries times in UTC). All of these summaries are from the time the app server was restarted until the first OE/IE error is raised.

2009-05-07T04:37:48 INFO Server zope.server.http (HTTP) started.
https://devpad.canonical.com/~matsubara/bug-360846-summaries/2009-05-07-03:37:48--06:05:00.html
2009-05-07T09:40:49 INFO Server zope.server.http (HTTP) started.
https://devpad.canonical.com/~matsubara/bug-360846-summaries/2009-05-07-08:40:49--16:56:00.html
2009-05-07T20:40:52 INFO Server zope.server.http (HTTP) started.
https://devpad.canonical.com/~matsubara/bug-360846-summaries/2009-05-07-19:40:52--21:28:00.html
2009-05-07T22:38:11 INFO Server zope.server.http (HTTP) started.
https://devpad.canonical.com/~matsubara/bug-360846-summaries/2009-05-07-21:28:11--23:59:59.html

Revision history for this message
James Henstridge (jamesh) wrote :

Fix merged to Storm's trunk in r305

Changed in storm:
assignee: nobody → Francis J. Lacoste (flacoste)
milestone: none → 0.15
status: New → Fix Committed
Revision history for this message
Francis J. Lacoste (flacoste) wrote :

The appropriate fix was cherry-picked on Friday evening.

Changed in launchpad-foundations:
status: In Progress → Fix Released
Revision history for this message
Francis J. Lacoste (flacoste) wrote :

Automatic recovery of further problem is described in bug 374909.

Jamu Kakar (jkakar)
Changed in storm:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.