Bug #193253 “sockets being leaked in branch puller tests” : Bugs : Bazaar

Revision history for this message

Stuart Bishop (stub) wrote on 2008-02-21:

#1

bzr+ssh://devpad.canonical.com/code/stub/launchpad/testsuite-leak-protection can be merged once this is fixed to stop this occurring again.

Jonathan Lange (jml) on 2008-02-24

Changed in launchpad-bazaar:
importance:	Undecided → High
milestone:	none → 1.2.3
status:	New → Confirmed

Revision history for this message

Jonathan Lange (jml) wrote on 2008-02-25:

#2

This was fixed (accidentally) by the recent scanner refactor.

Changed in launchpad-bazaar:
status:	Confirmed → Fix Released

Revision history for this message

Stuart Bishop (stub) wrote on 2008-02-27:

#3

We still have socket leaks. I still can't land my prophylactic:

(canonical.codehosting.tests.test_puller_acceptance.TestBranchPuller) (7.229 ms)Traceback (most recent call last):
  File "test.py", line 147, in ?
    result = testrunner.run(defaults)
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/zope/testing/testrunner.py", line 238, in run
    failed = not run_with_options(options)
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/zope/testing/testrunner.py", line 403, in run_with_options
    setup_layers, failures, errors)
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/zope/testing/testrunner.py", line 583, in run_layer
    return run_tests(options, tests, layer_name, failures, errors)
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/zope/testing/testrunner.py", line 512, in run_tests
    test(result)
  File "/usr/lib/python2.4/unittest.py", line 281, in __call__
    return self.run(*args, **kwds)
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/bzrlib/tests/__init__.py", line 1284, in run
    return unittest.TestCase.run(self, result)
  File "/usr/lib/python2.4/unittest.py", line 278, in run
    result.stopTest(self)
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/zope/testing/testrunner.py", line 861, in stopTest
    self.testTearDown()
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/zope/testing/testrunner.py", line 745, in testTearDown
    layer.testTearDown()
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/canonical/testing/profiled.py", line 24, in profiled_func
    return func(cls, *args, **kw)
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/canonical/testing/layers.py", line 179, in testTearDown
    raise LayerIsolationError(
canonical.testing.layers.LayerIsolationError: Test left uncollectable garbage
[<socket._fileobject object at 0x2b31ea2ebc90>] (referenced from [(<socket._fileobject object at 0x2b31ea2ebc90>,), [<socket._fileobject object at 0x2b31ea2ebc90>]])

We still have socket leaks. I still can't land my prophylactic:

(canonical.codehosting.tests.test_puller_acceptance.TestBranchPuller) (7.229 ms)Traceback (most recent call last):
  File "test.py", line 147, in ?
    result = testrunner.run(defaults)
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/zope/testing/testrunner.py", line 238, in run
    failed = not run_with_options(options)
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/zope/testing/testrunner.py", line 403, in run_with_options
    setup_layers, failures, errors)
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/zope/testing/testrunner.py", line 583, in run_layer
    return run_tests(options, tests, layer_name, failures, errors)
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/zope/testing/testrunner.py", line 512, in run_tests
    test(result)
  File "/usr/lib/python2.4/unittest.py", line 281, in __call__
    return self.run(*args, **kwds)
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/bzrlib/tests/__init__.py", line 1284, in run
    return unittest.TestCase.run(self, result)
  File "/usr/lib/python2.4/unittest.py", line 278, in run
    result.stopTest(self)
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/zope/testing/testrunner.py", line 861, in stopTest
    self.testTearDown()
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/zope/testing/testrunner.py", line 745, in testTearDown
    layer.testTearDown()
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/canonical/testing/profiled.py", line 24, in profiled_func
    return func(cls, *args, **kw)
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/canonical/testing/layers.py", line 179, in testTearDown
    raise LayerIsolationError(
canonical.testing.layers.LayerIsolationError: Test left uncollectable garbage
[<socket._fileobject object at 0x2b31ea2ebc90>] (referenced from [(<socket._fileobject object at 0x2b31ea2ebc90>,), [<socket._fileobject object at 0x2b31ea2ebc90>]])

Changed in launchpad-bazaar:
status:	Fix Released → Confirmed

Revision history for this message

Stuart Bishop (stub) wrote on 2008-03-05:

#4

This is caused by bzrlib. In particular bzrlib.tests.http_server.HTTPServer.tearDown contains the comment:

        # We don't need to 'self._http_thread.join()' here since the thread is
        # a daemonic one and will be garbage collected anyway. Joining just
        # slows us down for no added benefit.

By not waiting for threads to terminate, more intelligent test runners will sometimes pick up the garbage (violating test isolation) and rightly flag the test as failed.

Revision history for this message

Stuart Bishop (stub) wrote on 2008-03-05:

#5

Further investigation shows that it isn't the HTTPServer itself that is the dangling thread, but one of the threads it spawns to handle a request (SocketServer stuff). It looks like we have a test that is initiating a request but not completing it, so the handler thread is left blocked. This leave a dangling thread and a dangling socket.

Changed in bzr:
status:	New → Invalid

Revision history for this message

Jonathan Lange (jml) wrote on 2008-03-05:

#6

The fussy part of me wants to say that this is a different bug: the scanner tests don't leak any more.

All of the connections in the puller acceptance tests are done in subprocesses, I think. Does the test framework catch lingering processes?

Revision history for this message

Vincent Ladeuil (vila) wrote on 2008-03-05:

#7

I am the author of the comment.

These sockets are not leaking per se. They are garbage collected cleanly when needed (or I missed something). I leave them that way because joining the threads was taking a long time, slowing the test suite significantly.

I can dig the subject again if needed, but I'm pretty sure I checked there was no leaks because that comment was introduced while tracking socket leaks...

Revision history for this message

Stuart Bishop (stub) wrote on 2008-03-05:

#8

Vincent - I flagged the bzr task as Invalid. I don't see any need at the moment to change bzrlib at the moment as I think you are right and the HTTPServer thread is garbage collected during the teardown.

What I am seeing in our test suite now is a leaking thread and a leaking socket. I tracked down this to a thread spawned by SocketServer.ThreadingMixIn's proces_request() method. This thread never dies - if I make it non-daemonic the test suite never exits. This is as far as I have got - I expect that the handler is blocked waiting on input from a socket that our test suite never sends.

jml - I changed the bug topic rather than open a fresh bug :-) The goal of fixing this bug is to land some test suite updates that stop these sort of leaks happening in the future.

Revision history for this message

Vincent Ladeuil (vila) wrote on 2008-03-05: Re: [Bug 193253] Re: sockets being leaked in branch puller tests

#9

>>>>> "Stuart" == Stuart Bishop <email address hidden> writes:

    Stuart> Vincent - I flagged the bzr task as Invalid. I don't see any need at the
    Stuart> moment to change bzrlib at the moment as I think you are right and the
    Stuart> HTTPServer thread is garbage collected during the teardown.

Ok.

One thing to keep in mind is that what
SocketServer.ThreadingMixIn calls a request is really an http
connection not a http request.

    Stuart> What I am seeing in our test suite now is a leaking
    Stuart> thread and a leaking socket. I tracked down this to a
    Stuart> thread spawned by SocketServer.ThreadingMixIn's
    Stuart> proces_request() method. This thread never dies - if
    Stuart> I make it non-daemonic the test suite never
    Stuart> exits.

Hmmm. Don't do that. I can't remember the details (which I fully
realized are what *you* are interested in), but there is another
comment :

class TestingThreadingHTTPServer(SocketServer.ThreadingTCPServer,
TestingHTTPServerMixin):
<snip/>

    def __init__(self, server_address, request_handler_class,
                 test_case_server):
        TestingHTTPServerMixin.__init__(self, test_case_server)
        SocketServer.ThreadingTCPServer.__init__(self, server_address,
                                                 request_handler_class)
        # Decides how threads will act upon termination of the main
        # process. This is prophylactic as we should not leave the threads
        # lying around.
        self.daemon_threads = True

I don't remember the exact details, but roughly, since the http
client don't explicitly close the connection (bzr design), when
you use a persistent connection (as in http/1.1), you should
leave the connection thread dying in its own way and not force
the main thread to wait for the connection threads to die or
you'll block because you got the test waiting for the server to
die before it cleans itself making the client die (or something
like that, that's the best I can remember at the moment).

    Stuart> This is as far as I have got - I expect that the
    Stuart> handler is blocked waiting on input from a socket
    Stuart> that our test suite never sends.

Yeah, he's waiting for the client to *close* the connection,
which in the bzr case occurs during garbage collection *by
design*.

>>>>> "Stuart" == Stuart Bishop <stuart@stuartbishop.net> writes:

Stuart> Vincent - I flagged the bzr task as Invalid. I don't see any need at the
    Stuart> moment to change bzrlib at the moment as I think you are right and the
    Stuart> HTTPServer thread is garbage collected during the teardown.

Ok.

One thing to keep in mind is that what
SocketServer.ThreadingMixIn calls a request is really an http
connection not a http request.

Stuart> What I am seeing in our test suite now is a leaking
    Stuart> thread and a leaking socket. I tracked down this to a
    Stuart> thread spawned by SocketServer.ThreadingMixIn's
    Stuart> proces_request() method. This thread never dies - if
    Stuart> I make it non-daemonic the test suite never
    Stuart> exits.

Hmmm. Don't do that. I can't remember the details (which I fully
realized are what *you* are interested in), but there is another
comment :

class TestingThreadingHTTPServer(SocketServer.ThreadingTCPServer,
                                 TestingHTTPServerMixin):
<snip/>

def __init__(self, server_address, request_handler_class,
                 test_case_server):
        TestingHTTPServerMixin.__init__(self, test_case_server)
        SocketServer.ThreadingTCPServer.__init__(self, server_address,
                                                 request_handler_class)
        # Decides how threads will act upon termination of the main
        # process. This is prophylactic as we should not leave the threads
        # lying around.
        self.daemon_threads = True

I don't remember the exact details, but roughly, since the http
client don't explicitly close the connection (bzr design), when
you use a persistent connection (as in http/1.1), you should
leave the connection thread dying in its own way and not force
the main thread to wait for the connection threads to die or
you'll block because you got the test waiting for the server to
die before it cleans itself making the client die (or something
like that, that's the best I can remember at the moment).

Stuart> This is as far as I have got - I expect that the
    Stuart> handler is blocked waiting on input from a socket
    Stuart> that our test suite never sends.

Yeah, he's waiting for the client to *close* the connection,
which in the bzr case occurs during garbage collection *by
design*.

Jonathan Lange (jml) on 2008-03-07

Changed in launchpad-bazaar:
assignee:	nobody → jml

Revision history for this message

Jonathan Lange (jml) wrote on 2008-03-12:

#10

Hi Stuart,

I've just had a look at this. The client side of the connection is established by Branch.open in assertMirrored. Bazaar doesn't provide an API for closing this connection.

Given that Bazaar's behaviour is correct and that the test is using it sanely, I can think of three options:
- Fiddle with the Bazaar internals
- Run the test HTTP server in a subprocess and kill it when we are done
- Change the socket-leaking check

I'm leaning towards the third option. Like I said, Bazaar is working fine and we are using their API correctly, so it feels a bit backwards to have to "fix" it.

Thoughts?

Revision history for this message

Vincent Ladeuil (vila) wrote on 2008-03-12:

#11

>>>>> "Jonathan" == Jonathan Lange <email address hidden> writes:

Jonathan> Hi Stuart,

    Jonathan> I've just had a look at this. The client side of
    Jonathan> the connection is established by Branch.open in
    Jonathan> assertMirrored. Bazaar doesn't provide an API for
    Jonathan> closing this connection.

    Jonathan> Given that Bazaar's behaviour is correct and that
    Jonathan> the test is using it sanely, I can think of three
    Jonathan> options:

Jonathan> - Fiddle with the Bazaar internals

Jonathan> - Run the test HTTP server in a subprocess and kill
Jonathan> it when we are done

Jonathan> - Change the socket-leaking check

    Jonathan> I'm leaning towards the third option. Like I said,
    Jonathan> Bazaar is working fine and we are using their API
    Jonathan> correctly, so it feels a bit backwards to have to
    Jonathan> "fix" it.

Jonathan> Thoughts?

Memory kind of coming back, when I rewrote the HTTP server code,
at one point, I implemented a way to force the connection closing
on the *server* side.

I didn't continue in that direction because:

1) Tracking HTTP connections was a bit ugly,

2) Waiting for the server thread slowed down the test suite
significantly.

One reason for the server rewrite was indeed to fix all the
leaking sockets and AFAIK there are no more leaks anymore (at
least all the open sockets get gc'ed cleanly at some point) and
allowed the bzr test suite to run on OS X again (which was one of
the initial bugs).

Revision history for this message

Jonathan Lange (jml) wrote on 2008-03-12:

#12

I gave stub a patch for this that fixed a leak. From later IRC comments, it seems that there are other tests with the same leak.

The workaround is not use the path on disk to check that revisions match, rather than using the http url.

Changed in launchpad-bazaar:
assignee:	jml → stub
status:	Confirmed → In Progress

Tim Penhey (thumper) on 2008-04-06

Changed in launchpad-bazaar:
milestone:	1.2.3 → 1.2.4

Revision history for this message

Jonathan Lange (jml) wrote on 2008-04-06:

#13

r6010.

Changed in launchpad-bazaar:
status:	In Progress → Fix Committed

Tim Penhey (thumper) on 2008-05-01

Changed in launchpad-bazaar:
status:	Fix Committed → Fix Released

Affects		Status	Importance	Assigned to	Milestone
	Bazaar	Invalid	Undecided	Unassigned
	Launchpad itself	Fix Released	High	Stuart Bishop

Bazaar

sockets being leaked in branch puller tests

Bug Description

Other bug subscribers

Remote bug watches