Launchpad itself

App servers dying and leaving stale pidfile

Bug #247227 reported by Tom Haddon on 2008-07-10

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Fix Released	High	Unassigned

Bug Description

Since the move to storm, we've noticed that production app servers are occasionally dying and leaving a stale pidfile. Below are some relevant entries from our incident log:

2008-07-09 10:43 UTC - lpnet4 had died and left stale pidfile - restarted
2008-07-07 08:57 UTC - lpnet7 had died and left stale pidfile - restarted
2008-07-07 04:57 UTC - lpnet8 had died and left stale pidfile - restarted
2008-07-05 06:13 UTC - lpnet2 had died and left stale pidfile - restarted
2008-07-02 17:00 UTC - lpnet1 had died and left pidfile hanging around - restarted

This now appears to be also affecting staging (last two days, staging restore has failed to start the staging app server because of a stale pidfile from the app server dying), and staging is restarted every day as part of the upgrade.

There doesn't yet appear to be anything obvious in the logs or nohup files that we can see suggesting why the app servers are dying, but I've attached some access logs and app server logs from some died app servers.

Tags:

Revision history for this message

Tom Haddon (mthaddon) wrote on 2008-07-10:

app-server-logs.tar.bz2 Edit (29.5 KiB, application/octet-stream)

Changed in launchpad:
importance:	Undecided → High

Revision history for this message

Curtis Hovey (sinzui) wrote on 2008-07-10:

I have observed this when I run make schema. The running instance of launchpad dies silently.

Revision history for this message

Joey Stanford (joey) wrote on 2008-07-11:

definitely confirmed. Staging and lpnet both die

Changed in launchpad:
status:	New → Confirmed

Revision history for this message

Christian Reis (kiko) wrote on 2008-07-17:

I don't understand Curtis' comment. When running make schema with an already running Launchpad? But surely that's not what's affecting lpnet or edge?

Revision history for this message

Stuart Bishop (stub) wrote on 2008-07-19:

make schema will kill the database connection without notifying the client. It is exactly the same as what will happen if the connection is killed by the stale db connection reaper, and I've seen a few hits go past for the lpnet connections.

So I'm led to suspect that Launchpad is dying silently if the database server connection is killed (and possibly if there is a network drop out, which would be similar). It is supposed to be reconnecting if this happens.

We can test locally using "SELECT usename, procpid from pg_stat_activity" to get the process ids for the connections, and just use unix 'kill' to terminate the process.

Revision history for this message

Herb McNew (herb) wrote on 2008-07-28:

This continues to be a problem.

2008-07-28 14:04 UTC - lpnet1 had died and left stale pidfile - restarted
2008-07-27 21:40 UTC - staging had died and left stale pidfile - restarted
2008-07-25 20:05 UTC - lpnet2 had died and left stale pidfile - restarted
2008-07-25 18:58 UTC - lpnet2 had died and left stale pidfile - restarted
2008-07-25 11:28 UTC - lpnet2 had died and left stale pidfile - restarted
2008-07-25 09:15 UTC - lpnet4 had died and left stale pidfile - restarted
2008-07-24 01:40 UTC - staging had died and left stale pidfile - restarted
2008-07-22 18:38 UTC - lpnet3 had died and left stale pidfile - restarted
2008-07-18 18:55 UTC - lpnet1 had died and left stale pidfile - restarted
2008-07-14 17:38 UTC - lpnet7 had died and left stale pidfile - restarted
2008-07-11 16:11 UTC - lpnet3 had died and left stale pidfile - restarted
2008-07-10 08:40 UTC - lpnet6 had died and left stale pidfile - restarted

I have collected log files for a handful of these incidents listed above and can put them in a shared location for someone to look at.

I also tested Stuart's hypothesis above. Despite killing all lpnet1's connections to the DB, the app server continued running and happily reconnected.

Revision history for this message

Christian Reis (kiko) wrote on 2008-07-31:

Herb and I are working on running staging inside GDB so we can see what's going on when it dies.

Revision history for this message

Herb McNew (herb) wrote on 2008-08-13:

Over the last week or so we've seen problems with staging on the following occassions:

2008-08-13 09:45 UTC - Staging had died and left stale pidfile - restarted
2008-08-12 22:38 UTC - Staging had died and left stale pidfile - restarted
2008-08-12 22:13 UTC - Staging had died and left stale pidfile - restarted
2008-08-12 08:20 UTC - Staging had died and left stale pidfile - restarted
2008-08-11 21:50 UTC - Staging had died and left stale pidfile - restarted
2008-08-10 22:20 UTC - Staging had died and left stale pidfile - restarted
2008-08-09 22:25 UTC - Staging had died and left stale pidfile - restarted
2008-08-08 00:25 UTC - Staging had died and left stale pidfile - restarted

You can find the nohup output which contains the GDB output for staging in my home directory on devpad. Note: the dates on the nohup output files are in BST which is UTC +1 right now and they are created when the process is restarted.

Revision history for this message

Christian Reis (kiko) wrote on 2008-08-13:

#10

Cool, so the basics work. Herb, can you issue a "bt" so we get a backtrace before you run the quit? That way we hopefully will be able to see exactly where the crash is occurring.

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2008-08-25:

#11

Looking at the stack trace in https://devpad.canonical.com/~spm/staging-nohup.out-20080825230741.gz, I'd seriously suggest updating to that new version of psycopg2 we have packaged somewhere.

It seems to be crashing while one thread is in python's gc cycle collection while another thread is in _psycopg, and this suggests a refcounting bug in psycopg to me.

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2008-08-26:

#12

Oh, another thing that might make chasing this down easier is to have debug symbols for the _psycopg extension module around somehow.

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2008-08-28:

#13

Or it could be https://bugs.edge.launchpad.net/zope3/+bug/181833 (as apparently found by our local friendly Landscape team!)

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2008-08-28:

#14

After thinking a bit more, I've gone back to thinking this is a bug in psycopg2 and not in the proxy code.

Revision history for this message

Stuart Bishop (stub) wrote on 2009-04-06:

#15

No longer seems to be a problem (?)

Changed in launchpad-foundations:
status:	Triaged → Incomplete

Tom Haddon (mthaddon) on 2010-05-28

tags:

added: canonical-losa-lp

Revision history for this message

Robert Collins (lifeless) wrote on 2010-11-19:

#16

@Tom have you seen this -exact- case since?

Revision history for this message

Tom Haddon (mthaddon) wrote on 2010-11-19:

#17

Haven't seen this for a while as far as I can remember.

Gary Poster (gary) on 2010-11-23

Changed in launchpad-foundations:
status:	Incomplete → Fix Released

Curtis Hovey (sinzui) on 2012-05-10

visibility:

private → public

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

app-server-logs.tar.bz2 Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.