Launchpad itself

Codehosting server used a lot of memory and not releasing "old" processes

Bug #260171 reported by Tom Haddon on 2008-08-21

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Fix Released	High	Michael Hudson-Doyle	Launchpad itself 2.2.3

Bug Description

On 2008-08-20 17:00 UTC we had to restart the codehosting server because it was using excessive memory/swap and vostok was down to 13% swap free. There were 31 lp-serve processes using 32MB each, some from August 1st.

The logs should be available on devpad in the usual place:

/srv/launchpad.net-logs/scripts/vostok/

Thanks, Tom

Tags:

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2008-08-26:

31*32 is only a gig, which is a bit excessive but where was all the other memory going?

The processes hanging around issue is semi-known, but I don't think understood. At least, *I* don't understand it.

Revision history for this message

Tom Haddon (mthaddon) wrote on 2008-08-26:

Rsync backups were taking up some memory, but I think 31*31MB including that many stale processes is enough to worry about on it's own.

Revision history for this message

Jonathan Lange (jml) wrote on 2008-08-27:

I didn't know about the old processes issue. What do we know about it? Do we know what's causing it or whether it's reproducible locally?

Changed in launchpad-bazaar:
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2008-08-28:

Oh. It seems that 'bzr lp-serve' process gradually accumulate on the server, long after it seems likely that a client is interested.

Poking around now, it seems that a client is still connected, though, at least according to netstat. I guess if any traffic was sent down the socket we'd get a connection reset by peer error or something.

Maybe we should have an inactivity timeout in the ssh server?

Revision history for this message

Jonathan Lange (jml) wrote on 2008-08-28: Re: [Bug 260171] Re: Codehosting server used a lot of memory and not releasing "old" processes

On Fri, Aug 29, 2008 at 7:51 AM, Michael Hudson
<email address hidden> wrote:
> Oh. It seems that 'bzr lp-serve' process gradually accumulate on the
> server, long after it seems likely that a client is interested.
>
> Poking around now, it seems that a client is still connected, though, at
> least according to netstat. I guess if any traffic was sent down the
> socket we'd get a connection reset by peer error or something.
>
> Maybe we should have an inactivity timeout in the ssh server?
>

A timeout is a great idea. I still want to know the root cause, though.

jml

Revision history for this message

Tom Haddon (mthaddon) wrote on 2008-09-12:

Happening again (28 stale processes this time)

Revision history for this message

Jonathan Lange (jml) wrote on 2008-09-12:

It'd be good to get pystack reports from gdb the next time this happens. At least we'll get a clue as to what the code thinks it's doing.

Revision history for this message

Tom Haddon (mthaddon) wrote on 2008-09-13:

Can we try and replicate this on staging. I'm not sure diving right into using gdb on the production server is the best way to go here.

Revision history for this message

Jonathan Lange (jml) wrote on 2008-09-14:

On Sun, Sep 14, 2008 at 6:37 AM, Tom Haddon <email address hidden> wrote:
> Can we try and replicate this on staging. I'm not sure diving right into
> using gdb on the production server is the best way to go here.
>

Sure we can try. It will simply take much longer to replicate the
problem, since we're still unaware of what triggers it.

jml

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2008-09-17:

#10

So all these lp-serve processes seem to correspond to an ESTABLISHED connection in netstat. It seems very likely that the connections are in fact dead, but as we're not trying to send any data to the client, noone notices.

Can you set SO_KEEPALIVE on the server end of a socket?

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2008-09-19:

#11

I think I have a fix in progress for this. No idea how to test its effectiveness without cowboying it on to production for a few days though ...

Changed in launchpad-bazaar:
assignee:	nobody → mwhudson
milestone:	none → 2.1.10
status:	Triaged → In Progress

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2008-10-16:

#12

So there's a chance that the server move fixed this by accident.

If not, I'd like to test my patch at some point...

Changed in launchpad-bazaar:
milestone:	2.1.10 → 2.1.11

Revision history for this message

Diogo Matsubara (matsubara) wrote on 2008-11-24:

#13

Moving to .12

Changed in launchpad-bazaar:
milestone:	2.1.11 → 2.1.12

Paul Hummer (rockstar) on 2009-01-09

Changed in launchpad-bazaar:
milestone:	2.1.12 → 2.2.1

Michael Hudson-Doyle (mwhudson) on 2009-02-04

Changed in launchpad-bazaar:
milestone:	2.2.1 → 2.2.2

Revision history for this message

James Troup (elmo) wrote on 2009-02-11:

#14

ping. Even with the codehosting box upgraded to 8Gb of memory, this continues to be an issue...

Revision history for this message

Jonathan Lange (jml) wrote on 2009-02-11: Re: [Bug 260171] Re: Codehosting server used a lot of memory and not releasing "old" processes

#15

On Wed, Feb 11, 2009 at 9:26 PM, James Troup <email address hidden> wrote:
> ping. Even with the codehosting box upgraded to 8Gb of memory, this
> continues to be an issue...
>

Yeah, sorry. If could choose what I worked on, this would be it.

Michael's working on codebounce at the moment & I'm flat out on source
packages. Once I finish the branch I'm working on now, I'll fight for
the time to work on this (if no one else has got to it by then).

Tim Penhey (thumper) on 2009-02-28

Changed in launchpad-bazaar:
milestone:	2.2.2 → 2.2.3

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2009-03-12:

#16

So this morning I got an unexpected clue as to what might be causing this: I had an ssh process connected to bazaar.launchpad.net from yesterday -- and its parent was the bzr-notify process. Now, this may be bug #335180, or just the fact that bzr-notify doesn't run the cycle collector very often, but if this affects more than 25% of Canonical employees even, it could be responsible for a large enough number of apparently stale processes on the server.

I've just sent off to ec2test a change that will disconnect connections that are idle (no traffic in either direction) for more than an hour, which will hopefully kill the problem off once and for all.

Jonathan Lange (jml) on 2009-03-17

Changed in launchpad-bazaar:
status:	In Progress → Fix Released

Revision history for this message

Tom Haddon (mthaddon) wrote on 2009-12-01:

#17

Is this happening again? See https://pastebin.canonical.com/25181/

Revision history for this message

Jonathan Lange (jml) wrote on 2009-12-01:

#18

I did a simple experiment locally.

Make sure I had the latest db-stable
* Pull the latest db-stable (bzr pull)
* Update download-cache (bzr up download-cache)
* Update sourcecode (./utilities/update-sourcecode ~/src/launchpad/devel)

Change the codehosting timeout to 30 seconds:
* Edit configs/development/launchpad-lazr.conf
* Add "idle_timeout: 30" to the bottom of the codehosting clause

Make a tool that will do a Bazaar request and then wait:
* bzr branch lp:bzr-ping ~/.bazaar/plugins/ping
* Edit ~/.bazaar/plugins/ping/__init__.py, putting "import pdb; pdb.set_trace()" at the bottom of run()

Start the server:
* make schema
* make run_all

In a different terminal:
* Make a user (./utilities/make-lp-user jml)
* Simulate an idle connection (bzr ping bzr+ssh://bazaar.launchpad.dev)

Watch what happens:
* Check that the spawned bzr process is still running (ps ax | grep lp-serve)
* Repeat

30 seconds later, the process no longer appears in the 'ps' output, and the pdb window looks like this:

$ bzr ping bzr+ssh://bazaar.launchpad.dev
Response: ('ok', '2')
--Return--
> /home/jml/.bazaar/plugins/ping/__init__.py(47)run()->None
-> import pdb; pdb.set_trace()
(Pdb) Connection to launchpad.dev closed by remote host.

Curtis Hovey (sinzui) on 2012-05-10

visibility:

private → public

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.