Codehosting server used a lot of memory and not releasing "old" processes

Bug #260171 reported by Tom Haddon
20
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Michael Hudson-Doyle

Bug Description

On 2008-08-20 17:00 UTC we had to restart the codehosting server because it was using excessive memory/swap and vostok was down to 13% swap free. There were 31 lp-serve processes using 32MB each, some from August 1st.

The logs should be available on devpad in the usual place:

/srv/launchpad.net-logs/scripts/vostok/

Thanks, Tom

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

31*32 is only a gig, which is a bit excessive but where was all the other memory going?

The processes hanging around issue is semi-known, but I don't think understood. At least, *I* don't understand it.

Revision history for this message
Tom Haddon (mthaddon) wrote :

Rsync backups were taking up some memory, but I think 31*31MB including that many stale processes is enough to worry about on it's own.

Revision history for this message
Jonathan Lange (jml) wrote :

I didn't know about the old processes issue. What do we know about it? Do we know what's causing it or whether it's reproducible locally?

Changed in launchpad-bazaar:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Oh. It seems that 'bzr lp-serve' process gradually accumulate on the server, long after it seems likely that a client is interested.

Poking around now, it seems that a client is still connected, though, at least according to netstat. I guess if any traffic was sent down the socket we'd get a connection reset by peer error or something.

Maybe we should have an inactivity timeout in the ssh server?

Revision history for this message
Jonathan Lange (jml) wrote : Re: [Bug 260171] Re: Codehosting server used a lot of memory and not releasing "old" processes

On Fri, Aug 29, 2008 at 7:51 AM, Michael Hudson
<email address hidden> wrote:
> Oh. It seems that 'bzr lp-serve' process gradually accumulate on the
> server, long after it seems likely that a client is interested.
>
> Poking around now, it seems that a client is still connected, though, at
> least according to netstat. I guess if any traffic was sent down the
> socket we'd get a connection reset by peer error or something.
>
> Maybe we should have an inactivity timeout in the ssh server?
>

A timeout is a great idea. I still want to know the root cause, though.

jml

Revision history for this message
Tom Haddon (mthaddon) wrote :

Happening again (28 stale processes this time)

Revision history for this message
Jonathan Lange (jml) wrote :

It'd be good to get pystack reports from gdb the next time this happens. At least we'll get a clue as to what the code thinks it's doing.

Revision history for this message
Tom Haddon (mthaddon) wrote :

Can we try and replicate this on staging. I'm not sure diving right into using gdb on the production server is the best way to go here.

Revision history for this message
Jonathan Lange (jml) wrote :

On Sun, Sep 14, 2008 at 6:37 AM, Tom Haddon <email address hidden> wrote:
> Can we try and replicate this on staging. I'm not sure diving right into
> using gdb on the production server is the best way to go here.
>

Sure we can try. It will simply take much longer to replicate the
problem, since we're still unaware of what triggers it.

jml

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

So all these lp-serve processes seem to correspond to an ESTABLISHED connection in netstat. It seems very likely that the connections are in fact dead, but as we're not trying to send any data to the client, noone notices.

Can you set SO_KEEPALIVE on the server end of a socket?

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

I think I have a fix in progress for this. No idea how to test its effectiveness without cowboying it on to production for a few days though ...

Changed in launchpad-bazaar:
assignee: nobody → mwhudson
milestone: none → 2.1.10
status: Triaged → In Progress
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

So there's a chance that the server move fixed this by accident.

If not, I'd like to test my patch at some point...

Changed in launchpad-bazaar:
milestone: 2.1.10 → 2.1.11
Revision history for this message
Diogo Matsubara (matsubara) wrote :

Moving to .12

Changed in launchpad-bazaar:
milestone: 2.1.11 → 2.1.12
Paul Hummer (rockstar)
Changed in launchpad-bazaar:
milestone: 2.1.12 → 2.2.1
Changed in launchpad-bazaar:
milestone: 2.2.1 → 2.2.2
Revision history for this message
James Troup (elmo) wrote :

ping. Even with the codehosting box upgraded to 8Gb of memory, this continues to be an issue...

Revision history for this message
Jonathan Lange (jml) wrote : Re: [Bug 260171] Re: Codehosting server used a lot of memory and not releasing "old" processes

On Wed, Feb 11, 2009 at 9:26 PM, James Troup <email address hidden> wrote:
> ping. Even with the codehosting box upgraded to 8Gb of memory, this
> continues to be an issue...
>

Yeah, sorry. If could choose what I worked on, this would be it.

Michael's working on codebounce at the moment & I'm flat out on source
packages. Once I finish the branch I'm working on now, I'll fight for
the time to work on this (if no one else has got to it by then).

Tim Penhey (thumper)
Changed in launchpad-bazaar:
milestone: 2.2.2 → 2.2.3
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

So this morning I got an unexpected clue as to what might be causing this: I had an ssh process connected to bazaar.launchpad.net from yesterday -- and its parent was the bzr-notify process. Now, this may be bug #335180, or just the fact that bzr-notify doesn't run the cycle collector very often, but if this affects more than 25% of Canonical employees even, it could be responsible for a large enough number of apparently stale processes on the server.

I've just sent off to ec2test a change that will disconnect connections that are idle (no traffic in either direction) for more than an hour, which will hopefully kill the problem off once and for all.

Jonathan Lange (jml)
Changed in launchpad-bazaar:
status: In Progress → Fix Released
Revision history for this message
Tom Haddon (mthaddon) wrote :

Is this happening again? See https://pastebin.canonical.com/25181/

Revision history for this message
Jonathan Lange (jml) wrote :

I did a simple experiment locally.

Make sure I had the latest db-stable
 * Pull the latest db-stable (bzr pull)
 * Update download-cache (bzr up download-cache)
 * Update sourcecode (./utilities/update-sourcecode ~/src/launchpad/devel)

Change the codehosting timeout to 30 seconds:
 * Edit configs/development/launchpad-lazr.conf
 * Add "idle_timeout: 30" to the bottom of the codehosting clause

Make a tool that will do a Bazaar request and then wait:
 * bzr branch lp:bzr-ping ~/.bazaar/plugins/ping
 * Edit ~/.bazaar/plugins/ping/__init__.py, putting "import pdb; pdb.set_trace()" at the bottom of run()

Start the server:
 * make schema
 * make run_all

In a different terminal:
 * Make a user (./utilities/make-lp-user jml)
 * Simulate an idle connection (bzr ping bzr+ssh://bazaar.launchpad.dev)

Watch what happens:
 * Check that the spawned bzr process is still running (ps ax | grep lp-serve)
 * Repeat

30 seconds later, the process no longer appears in the 'ps' output, and the pdb window looks like this:

$ bzr ping bzr+ssh://bazaar.launchpad.dev
Response: ('ok', '2')
--Return--
> /home/jml/.bazaar/plugins/ping/__init__.py(47)run()->None
-> import pdb; pdb.set_trace()
(Pdb) Connection to launchpad.dev closed by remote host.

Curtis Hovey (sinzui)
visibility: private → public
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.