App servers dying and leaving stale pidfile

Bug #247227 reported by Tom Haddon
16
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Unassigned

Bug Description

Since the move to storm, we've noticed that production app servers are occasionally dying and leaving a stale pidfile. Below are some relevant entries from our incident log:

 2008-07-09 10:43 UTC - lpnet4 had died and left stale pidfile - restarted
 2008-07-07 08:57 UTC - lpnet7 had died and left stale pidfile - restarted
 2008-07-07 04:57 UTC - lpnet8 had died and left stale pidfile - restarted
 2008-07-05 06:13 UTC - lpnet2 had died and left stale pidfile - restarted
 2008-07-02 17:00 UTC - lpnet1 had died and left pidfile hanging around - restarted

This now appears to be also affecting staging (last two days, staging restore has failed to start the staging app server because of a stale pidfile from the app server dying), and staging is restarted every day as part of the upgrade.

There doesn't yet appear to be anything obvious in the logs or nohup files that we can see suggesting why the app servers are dying, but I've attached some access logs and app server logs from some died app servers.

Revision history for this message
Tom Haddon (mthaddon) wrote :
Changed in launchpad:
importance: Undecided → High
Revision history for this message
Curtis Hovey (sinzui) wrote :

I have observed this when I run make schema. The running instance of launchpad dies silently.

Revision history for this message
Joey Stanford (joey) wrote :

definitely confirmed. Staging and lpnet both die

Changed in launchpad:
status: New → Confirmed
Revision history for this message
Christian Reis (kiko) wrote :

I don't understand Curtis' comment. When running make schema with an already running Launchpad? But surely that's not what's affecting lpnet or edge?

Revision history for this message
Stuart Bishop (stub) wrote :

make schema will kill the database connection without notifying the client. It is exactly the same as what will happen if the connection is killed by the stale db connection reaper, and I've seen a few hits go past for the lpnet connections.

So I'm led to suspect that Launchpad is dying silently if the database server connection is killed (and possibly if there is a network drop out, which would be similar). It is supposed to be reconnecting if this happens.

We can test locally using "SELECT usename, procpid from pg_stat_activity" to get the process ids for the connections, and just use unix 'kill' to terminate the process.

Revision history for this message
Herb McNew (herb) wrote :

This continues to be a problem.

2008-07-28 14:04 UTC - lpnet1 had died and left stale pidfile - restarted
2008-07-27 21:40 UTC - staging had died and left stale pidfile - restarted
2008-07-25 20:05 UTC - lpnet2 had died and left stale pidfile - restarted
2008-07-25 18:58 UTC - lpnet2 had died and left stale pidfile - restarted
2008-07-25 11:28 UTC - lpnet2 had died and left stale pidfile - restarted
2008-07-25 09:15 UTC - lpnet4 had died and left stale pidfile - restarted
2008-07-24 01:40 UTC - staging had died and left stale pidfile - restarted
2008-07-22 18:38 UTC - lpnet3 had died and left stale pidfile - restarted
2008-07-18 18:55 UTC - lpnet1 had died and left stale pidfile - restarted
2008-07-14 17:38 UTC - lpnet7 had died and left stale pidfile - restarted
2008-07-11 16:11 UTC - lpnet3 had died and left stale pidfile - restarted
2008-07-10 08:40 UTC - lpnet6 had died and left stale pidfile - restarted

I have collected log files for a handful of these incidents listed above and can put them in a shared location for someone to look at.

I also tested Stuart's hypothesis above. Despite killing all lpnet1's connections to the DB, the app server continued running and happily reconnected.

Revision history for this message
Christian Reis (kiko) wrote :

Herb and I are working on running staging inside GDB so we can see what's going on when it dies.

Revision history for this message
Herb McNew (herb) wrote :

Over the last week or so we've seen problems with staging on the following occassions:

2008-08-13 09:45 UTC - Staging had died and left stale pidfile - restarted
2008-08-12 22:38 UTC - Staging had died and left stale pidfile - restarted
2008-08-12 22:13 UTC - Staging had died and left stale pidfile - restarted
2008-08-12 08:20 UTC - Staging had died and left stale pidfile - restarted
2008-08-11 21:50 UTC - Staging had died and left stale pidfile - restarted
2008-08-10 22:20 UTC - Staging had died and left stale pidfile - restarted
2008-08-09 22:25 UTC - Staging had died and left stale pidfile - restarted
2008-08-08 00:25 UTC - Staging had died and left stale pidfile - restarted

You can find the nohup output which contains the GDB output for staging in my home directory on devpad. Note: the dates on the nohup output files are in BST which is UTC +1 right now and they are created when the process is restarted.

Revision history for this message
Christian Reis (kiko) wrote :

Cool, so the basics work. Herb, can you issue a "bt" so we get a backtrace before you run the quit? That way we hopefully will be able to see exactly where the crash is occurring.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Looking at the stack trace in https://devpad.canonical.com/~spm/staging-nohup.out-20080825230741.gz, I'd seriously suggest updating to that new version of psycopg2 we have packaged somewhere.

It seems to be crashing while one thread is in python's gc cycle collection while another thread is in _psycopg, and this suggests a refcounting bug in psycopg to me.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Oh, another thing that might make chasing this down easier is to have debug symbols for the _psycopg extension module around somehow.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Or it could be https://bugs.edge.launchpad.net/zope3/+bug/181833 (as apparently found by our local friendly Landscape team!)

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

After thinking a bit more, I've gone back to thinking this is a bug in psycopg2 and not in the proxy code.

Revision history for this message
Stuart Bishop (stub) wrote :

No longer seems to be a problem (?)

Changed in launchpad-foundations:
status: Triaged → Incomplete
Tom Haddon (mthaddon)
tags: added: canonical-losa-lp
Revision history for this message
Robert Collins (lifeless) wrote :

@Tom have you seen this -exact- case since?

Revision history for this message
Tom Haddon (mthaddon) wrote :

Haven't seen this for a while as far as I can remember.

Gary Poster (gary)
Changed in launchpad-foundations:
status: Incomplete → Fix Released
Curtis Hovey (sinzui)
visibility: private → public
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.