bzr serve leaves connections in CLOSE_WAIT a *long* time

Bug #164288 reported by Andrew Cowie
4
Affects Status Importance Assigned to Milestone
Bazaar
Fix Released
High
Unassigned

Bug Description

We have a bzr:// server running to serve our developers round the world, and Bazaar does not appear to be closing its connections properly. That's Bad (tm).

I only noticed because (obviously) we otherwise get low traffic to the server in question so the CLOSE_WAIT on port 4155 really stands out.

AfC

Revision history for this message
Martin Pool (mbp) wrote :

http://www.sunmanagers.org/pipermail/summaries/2006-January/007068.html says:

"""CLOSE_WAIT means that the local end of the connection has received
a FIN from the other end, but the OS is waiting for the program at the
local end to actually close its connection.

The problem is your program running on the local machine is not closing
the socket. It is not a TCP tuning issue. A connection can (and quite
correctly) stay in CLOSE_WAIT forever while the program holds the
connection open.

Once the local program closes the socket, the OS can send the FIN to
the remote end which transitions you to LAST_ACK while you wait for
the ACK of the FIN. Once that is received, the connection is finished
and drops from the connection table (if you're end is in CLOSE_WAIT
you do _not_ end up in the TIME_WAIT state)."""

Changed in bzr:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Andrew Cowie (afcowie) wrote :

It is more severe than I thought:

If you try to shutdown then restart `bzr serve` (ie, as a system daemon), the fact that there are still open fd (ok, open socket) on port 4155 means that **bzr serve will not start**. This means we had an outage of some large number of minutes today waiting for the TCP stack to clear.

This is still occurring with bzr 1.1

AfC

Revision history for this message
Robert Collins (lifeless) wrote :

so_reuseaddr ftw kthxbye should be a trivial fix, and it really does make operating such a server a pillock.

Changed in bzr:
importance: Medium → High
Revision history for this message
John A Meinel (jameinel) wrote :

Specifically, the simple fix for this is something like:

=== modified file 'bzrlib/smart/server.py'
--- bzrlib/smart/server.py 2007-12-13 22:22:58 +0000
+++ bzrlib/smart/server.py 2008-04-18 21:18:54 +0000
@@ -59,6 +59,10 @@
         self._socket_error = socket_error
         self._socket_timeout = socket_timeout
         self._server_socket = socket.socket()
+ reuse_addr = getattr(socket, 'SO_REUSEADDR', None)
+ if reuse_addr is not None:
+ self._server_socket.setsockopt(socket.SOL_SOCKET,
+ socket.SO_REUSEADDR, 1)
         self._server_socket.bind((host, port))
         self._sockname = self._server_socket.getsockname()
         self.port = self._sockname[1]

Revision history for this message
Martin Pool (mbp) wrote :

See http://bugs.python.org/issue2550 -- SO_REUSEADDR has a different meaning on Windows. It should be defined everywhere though.

Note that sockets will still be in close_wait - this is the specified behaviour and quite normal - but it won't prevent starting a new server.

Not sure what Robert means by http://www.urbandictionary.com/define.php?term=pillock

Martin Pool (mbp)
Changed in bzr:
status: Confirmed → Fix Committed
Revision history for this message
Andrew Bennetts (spiv) wrote :

Note that SO_REUSEADDR has confusingly similar-but-different semantics on Windows — it allows "stealing" a port from a running process (in fact I think the MSDN docs say it's "non-deterministic" which process will receive a new TCP connection in that situation).

Using it is probably the lesser of two evils, though.

Jelmer Vernooij (jelmer)
Changed in bzr:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.