sockets being leaked in branch puller tests

Bug #193253 reported by Stuart Bishop
2
Affects Status Importance Assigned to Milestone
Bazaar
Invalid
Undecided
Unassigned
Launchpad itself
Fix Released
High
Stuart Bishop

Bug Description

Add the following teardown method to canonical.codehosting.tests.test_scanner_bug_links.RevisionPropertyParsing:

    def tearDown(self):
        super(RevisionPropertyParsing, self).tearDown()
        import gc
        from pprint import pprint
        gc.collect()
        if gc.garbage:
            print 'Garbage:'
            pprint(gc.garbage)
            print '\nReferenced by:'
            pprint(gc.get_referrers(*gc.garbage))

It looks like a bzrlib.transport.http._urllib2_wrappers.Response is being constructed but never explicitly closed and this is leaking due to a reference cycle containing objects with a __del__ method (socket._fileobject I think, which would be closed when Response.close is called).

Tags: lp-code
Revision history for this message
Stuart Bishop (stub) wrote :

bzr+ssh://devpad.canonical.com/code/stub/launchpad/testsuite-leak-protection can be merged once this is fixed to stop this occurring again.

Jonathan Lange (jml)
Changed in launchpad-bazaar:
importance: Undecided → High
milestone: none → 1.2.3
status: New → Confirmed
Revision history for this message
Jonathan Lange (jml) wrote :

This was fixed (accidentally) by the recent scanner refactor.

Changed in launchpad-bazaar:
status: Confirmed → Fix Released
Revision history for this message
Stuart Bishop (stub) wrote :

We still have socket leaks. I still can't land my prophylactic:

(canonical.codehosting.tests.test_puller_acceptance.TestBranchPuller) (7.229 ms)Traceback (most recent call last):
  File "test.py", line 147, in ?
    result = testrunner.run(defaults)
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/zope/testing/testrunner.py", line 238, in run
    failed = not run_with_options(options)
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/zope/testing/testrunner.py", line 403, in run_with_options
    setup_layers, failures, errors)
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/zope/testing/testrunner.py", line 583, in run_layer
    return run_tests(options, tests, layer_name, failures, errors)
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/zope/testing/testrunner.py", line 512, in run_tests
    test(result)
  File "/usr/lib/python2.4/unittest.py", line 281, in __call__
    return self.run(*args, **kwds)
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/bzrlib/tests/__init__.py", line 1284, in run
    return unittest.TestCase.run(self, result)
  File "/usr/lib/python2.4/unittest.py", line 278, in run
    result.stopTest(self)
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/zope/testing/testrunner.py", line 861, in stopTest
    self.testTearDown()
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/zope/testing/testrunner.py", line 745, in testTearDown
    layer.testTearDown()
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/canonical/testing/profiled.py", line 24, in profiled_func
    return func(cls, *args, **kw)
  File "/home/pqm/pqm-workdir/home/---devel/launchpad/lib/canonical/testing/layers.py", line 179, in testTearDown
    raise LayerIsolationError(
canonical.testing.layers.LayerIsolationError: Test left uncollectable garbage
[<socket._fileobject object at 0x2b31ea2ebc90>] (referenced from [(<socket._fileobject object at 0x2b31ea2ebc90>,), [<socket._fileobject object at 0x2b31ea2ebc90>]])

Changed in launchpad-bazaar:
status: Fix Released → Confirmed
Revision history for this message
Stuart Bishop (stub) wrote :

This is caused by bzrlib. In particular bzrlib.tests.http_server.HTTPServer.tearDown contains the comment:

        # We don't need to 'self._http_thread.join()' here since the thread is
        # a daemonic one and will be garbage collected anyway. Joining just
        # slows us down for no added benefit.

By not waiting for threads to terminate, more intelligent test runners will sometimes pick up the garbage (violating test isolation) and rightly flag the test as failed.

Revision history for this message
Stuart Bishop (stub) wrote :

Further investigation shows that it isn't the HTTPServer itself that is the dangling thread, but one of the threads it spawns to handle a request (SocketServer stuff). It looks like we have a test that is initiating a request but not completing it, so the handler thread is left blocked. This leave a dangling thread and a dangling socket.

Changed in bzr:
status: New → Invalid
Revision history for this message
Jonathan Lange (jml) wrote :

The fussy part of me wants to say that this is a different bug: the scanner tests don't leak any more.

All of the connections in the puller acceptance tests are done in subprocesses, I think. Does the test framework catch lingering processes?

Revision history for this message
Vincent Ladeuil (vila) wrote :

I am the author of the comment.

These sockets are not leaking per se. They are garbage collected cleanly when needed (or I missed something). I leave them that way because joining the threads was taking a long time, slowing the test suite significantly.

I can dig the subject again if needed, but I'm pretty sure I checked there was no leaks because that comment was introduced while tracking socket leaks...

Revision history for this message
Stuart Bishop (stub) wrote :

Vincent - I flagged the bzr task as Invalid. I don't see any need at the moment to change bzrlib at the moment as I think you are right and the HTTPServer thread is garbage collected during the teardown.

What I am seeing in our test suite now is a leaking thread and a leaking socket. I tracked down this to a thread spawned by SocketServer.ThreadingMixIn's proces_request() method. This thread never dies - if I make it non-daemonic the test suite never exits. This is as far as I have got - I expect that the handler is blocked waiting on input from a socket that our test suite never sends.

jml - I changed the bug topic rather than open a fresh bug :-) The goal of fixing this bug is to land some test suite updates that stop these sort of leaks happening in the future.

Revision history for this message
Vincent Ladeuil (vila) wrote : Re: [Bug 193253] Re: sockets being leaked in branch puller tests

>>>>> "Stuart" == Stuart Bishop <email address hidden> writes:

    Stuart> Vincent - I flagged the bzr task as Invalid. I don't see any need at the
    Stuart> moment to change bzrlib at the moment as I think you are right and the
    Stuart> HTTPServer thread is garbage collected during the teardown.

Ok.

One thing to keep in mind is that what
SocketServer.ThreadingMixIn calls a request is really an http
connection not a http request.

    Stuart> What I am seeing in our test suite now is a leaking
    Stuart> thread and a leaking socket. I tracked down this to a
    Stuart> thread spawned by SocketServer.ThreadingMixIn's
    Stuart> proces_request() method. This thread never dies - if
    Stuart> I make it non-daemonic the test suite never
    Stuart> exits.

Hmmm. Don't do that. I can't remember the details (which I fully
realized are what *you* are interested in), but there is another
comment :

class TestingThreadingHTTPServer(SocketServer.ThreadingTCPServer,
                                 TestingHTTPServerMixin):
<snip/>

    def __init__(self, server_address, request_handler_class,
                 test_case_server):
        TestingHTTPServerMixin.__init__(self, test_case_server)
        SocketServer.ThreadingTCPServer.__init__(self, server_address,
                                                 request_handler_class)
        # Decides how threads will act upon termination of the main
        # process. This is prophylactic as we should not leave the threads
        # lying around.
        self.daemon_threads = True

I don't remember the exact details, but roughly, since the http
client don't explicitly close the connection (bzr design), when
you use a persistent connection (as in http/1.1), you should
leave the connection thread dying in its own way and not force
the main thread to wait for the connection threads to die or
you'll block because you got the test waiting for the server to
die before it cleans itself making the client die (or something
like that, that's the best I can remember at the moment).

    Stuart> This is as far as I have got - I expect that the
    Stuart> handler is blocked waiting on input from a socket
    Stuart> that our test suite never sends.

Yeah, he's waiting for the client to *close* the connection,
which in the bzr case occurs during garbage collection *by
design*.

Jonathan Lange (jml)
Changed in launchpad-bazaar:
assignee: nobody → jml
Revision history for this message
Jonathan Lange (jml) wrote :

Hi Stuart,

I've just had a look at this. The client side of the connection is established by Branch.open in assertMirrored. Bazaar doesn't provide an API for closing this connection.

Given that Bazaar's behaviour is correct and that the test is using it sanely, I can think of three options:
- Fiddle with the Bazaar internals
- Run the test HTTP server in a subprocess and kill it when we are done
- Change the socket-leaking check

I'm leaning towards the third option. Like I said, Bazaar is working fine and we are using their API correctly, so it feels a bit backwards to have to "fix" it.

Thoughts?

Revision history for this message
Vincent Ladeuil (vila) wrote :

>>>>> "Jonathan" == Jonathan Lange <email address hidden> writes:

    Jonathan> Hi Stuart,

    Jonathan> I've just had a look at this. The client side of
    Jonathan> the connection is established by Branch.open in
    Jonathan> assertMirrored. Bazaar doesn't provide an API for
    Jonathan> closing this connection.

    Jonathan> Given that Bazaar's behaviour is correct and that
    Jonathan> the test is using it sanely, I can think of three
    Jonathan> options:

    Jonathan> - Fiddle with the Bazaar internals

    Jonathan> - Run the test HTTP server in a subprocess and kill
    Jonathan> it when we are done

    Jonathan> - Change the socket-leaking check

    Jonathan> I'm leaning towards the third option. Like I said,
    Jonathan> Bazaar is working fine and we are using their API
    Jonathan> correctly, so it feels a bit backwards to have to
    Jonathan> "fix" it.

    Jonathan> Thoughts?

Memory kind of coming back, when I rewrote the HTTP server code,
at one point, I implemented a way to force the connection closing
on the *server* side.

I didn't continue in that direction because:

1) Tracking HTTP connections was a bit ugly,

2) Waiting for the server thread slowed down the test suite
   significantly.

One reason for the server rewrite was indeed to fix all the
leaking sockets and AFAIK there are no more leaks anymore (at
least all the open sockets get gc'ed cleanly at some point) and
allowed the bzr test suite to run on OS X again (which was one of
the initial bugs).

Revision history for this message
Jonathan Lange (jml) wrote :

I gave stub a patch for this that fixed a leak. From later IRC comments, it seems that there are other tests with the same leak.

The workaround is not use the path on disk to check that revisions match, rather than using the http url.

Changed in launchpad-bazaar:
assignee: jml → stub
status: Confirmed → In Progress
Tim Penhey (thumper)
Changed in launchpad-bazaar:
milestone: 1.2.3 → 1.2.4
Revision history for this message
Jonathan Lange (jml) wrote :

r6010.

Changed in launchpad-bazaar:
status: In Progress → Fix Committed
Tim Penhey (thumper)
Changed in launchpad-bazaar:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.