interrupting a client waiting for a lock over bzr+ssh leaves process on server

Bug #172392 reported by Michael Hudson-Doyle
4
Affects Status Importance Assigned to Milestone
Bazaar
Fix Released
High
Martin Pool
Launchpad itself
Fix Released
High
Jonathan Lange

Bug Description

This is a bit confusing. Here's how to reproduce:

1) lock a branch (i used "python2.4 -c 'from bzrlib import branch; branch.Branch.open("/tmp/whatever").lock_write(); import time; time.sleep(100)' &" for this)
2) attempt to push to the branch over bzr+ssh ("bzr push bzr+ssh://localhost/tmp/whatever"). Unsurprisingly, it waits for a lock. C-c it to exit it.
3) break the lock (with "bzr break-lock /tmp/whatever", no bzr+ssh here)
4) attempt to push again. It still waits for a lock. Wtf??

I think what is happening here is that the 'bzr serve' process created in step 2 is still hanging around waiting for the lock that was created in step 1 to be released. Step 3 breaks this lock, allowing this 'bzr serve' process to grab the lock and cause confusion in step 4.

If I remember discussion correctly from somewhere, bzr probably isn't closing the ssh channel properly or something. Anyway, this bug creates very confusing situations, so it should be fixed sooner rather than later IMHO...

Tags: hpss lp-code
Revision history for this message
John A Meinel (jameinel) wrote :

I'm guessing the fix might be for the bzr serve process to notice when the client has exited after taking the lock, and go ahead and unlock itself.
I know we are trying hard for the smart protocol to be stateless, though. Which makes it difficult to have any sort of chat between the client and server, to let it know that the client is still around.

Another issue is having the bzr server say to the client "I'm waiting on the lock, I'll get back to you in a bit." I'm not sure this can be done if we expect each RPC to have a single Request => Response. Rather than a Request => Ongoing Response. Though if HTTP supports "Ongoing Response" then we could probably make use of it.

The *best* solution is to get more smarts in the protocol, so that we never actually have to request a lock be taken. If all actions are streamed and self-contained then the bzr process on the server will lock and unlock on its own, and not care if the client disconnects.

Changed in bzr:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Andrew Bennetts (spiv) wrote :

I *think* Michael's hypothesis is quite likely right observed behaviour; the default lock poll interval is 1 second, so unless you had a supernaturally fast SSH connection setup in step 4, you'd never beat the "bzr serve" process from step 2 to the lock.

Revision history for this message
Jonathan Lange (jml) wrote :

I'm not sure how this is different from https://bugs.edge.launchpad.net/bzr/+bug/141172

Revision history for this message
John A Meinel (jameinel) wrote :

Even if the client disconnects properly the server process will likely be left running on the other side. As it doesn't have any idea that the client has disconnected. It will continue trying to grab the lock. Only once the lock is obtained would it respond to the client and find out that it had disconnected.

So bug 141172 may have been prompted because of this bug, but fixing it will not fix this bug.

Revision history for this message
Jonathan Lange (jml) wrote :

Andrew has said that he will look at this on Monday. We'd really like this fixed for the 1.0 / 1.12 release.

Changed in bzr:
assignee: nobody → spiv
importance: Medium → High
Revision history for this message
Andrew Bennetts (spiv) wrote :

The steps to reproduce work.

The workaround is to repeatedly do "bzr break-lock" until it no longer prompts the user to break a lock. At that point you can assume any backlog of stale "bzr serve" processes that are still waiting on the lock will have all acquired the lock (and had their locks broken).

Revision history for this message
Martin Pool (mbp) wrote :

As an overview of all these bugs:

The client terminates. The ssh client child process is left behind, as is the server process. The server keeps trying to take the lock, and eventually succeeds.

> Even if the client disconnects properly the server process will likely be left running on the other side. As it doesn't have any idea that the client has disconnected.

jml says that this will not happen on Launchpad, where the ssh server will notice that the client has disconnected, and send a sighup to the bzr server process. However, that may not happen with other ssh servers.

To check this, I ran 'ssh localhost sleep 1h', and kill-9'd the client. This left the sleep process running and as far as I can tell with strace, it did not get a sighup. So it looks like just killing the client is not enough to be sure the server will shut down.

One approach would be for the server to just try the lock once, rather than blocking on it, and then come back to the client to say if it was busy. This fixes a few things: the client gets told that it's waiting and can potentially show progress, and the server should rarely take a lock when there's no longer a client wanting it.

Revision history for this message
Martin Pool (mbp) wrote :

taking this from andrew, working on it with jml

Changed in bzr:
assignee: spiv → mbp
Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 172392] Re: interrupting a client waiting for a lock over bzr+ssh leaves process on server

On Mon, 2008-03-17 at 01:37 +0000, Martin Pool wrote:
> As an overview of all these bugs:
...
> To check this, I ran 'ssh localhost sleep 1h', and kill-9'd the client.
> This left the sleep process running and as far as I can tell with
> strace, it did not get a sighup. So it looks like just killing the
> client is not enough to be sure the server will shut down.

what state was the socket left in?

> One approach would be for the server to just try the lock once, rather
> than blocking on it, and then come back to the client to say if it was
> busy. This fixes a few things: the client gets told that it's waiting
> and can potentially show progress, and the server should rarely take a
> lock when there's no longer a client wanting it.

It will reduce the window, but it won't remove it entirely (networks for
the win). On the other hand, it will cause more ssh server branch-open
operations at the far end, and will likely skew lock obtaining to folk
with faster links.

-Rob
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.

Revision history for this message
Martin Pool (mbp) wrote : Re: [Bug 172392] Re: interrupting a client waiting for a lock over bzr+ssh leaves process on server

> what state was the socket left in?

TIME_WAIT

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message
Martin Pool (mbp) wrote :

bug 40508 and bug 5987 are the history for why ssh is handled this way.

When bzr gets a ^c, we don't want to have that signal sent to the ssh subprocess too, because bzr may want to do some cleanup before it exits, such as releasing a lock (assuming the connection is still working at all). We don't want to revert that change.

You could argue that we should always just immediately stop, without attempting any cleanup, though this will cause some leftover temporary files, broken locks, etc. But they should be recoverable. Doing cleanup across the network may anyhow be slow and/or unreliable.

We could do one of these:

1- keep a list of ssh subprocesses, and when exiting the bzr process (atexit) send sighup to all of them
2- when closing the remote medium, wait a short time for ssh to exit, then send sighup to it

Revision history for this message
Martin Pool (mbp) wrote :

Andrew did (on 2007-12-19, bzr.dev r3118.3.1) already merge a change that in cmd_serve reduces the timeout to zero. So, I'm surprised that we're still seeing the server lingering and attempting to take the locks.

Revision history for this message
Martin Pool (mbp) wrote :

I believe the bzr lpserve code can avoid this by setting
  bzrlib.lockdir._DEFAULT_TIMEOUT_SECONDS = 0

which is what is done in cmd_serve. You will probably want an integration test for it.

Changed in launchpad-bazaar:
assignee: nobody → jml
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Martin Pool (mbp) wrote :

After some investigation: I think this is already fixed in bzr itself, because we disable waiting for locks at a high level when starting the server. Because Launchpad doesn't run "bzr serve" this is not active. I suggest they should do the same thing.

On the whole, since the server process never wants to wait for locks regardless of where they are taken, setting global state to indicate this seems reasonable.

Changed in bzr:
status: Triaged → Fix Released
Jonathan Lange (jml)
Changed in launchpad-bazaar:
status: Confirmed → Fix Released
Revision history for this message
lszyba1 (szybalski) wrote :

Using saved location: bzr+ssh://<email address hidden>/~username/myproject/trunk/
Unable to obtain lock lp-46075600:///~username/myproject/trunk/.bzr/branch/lock
held by <email address hidden> on host crowberry [process #9668]
locked 9 minutes, 23 seconds ago
Will continue to try until 02:10:45, unless you press Ctrl-C
If you're sure that it's not being modified, use bzr break-lock lp-46075600:///~username/myproject/trunk/.bzr/branch/lock
bzr: ERROR: Could not acquire lock "(remote lock)"

How can I unlock this?
Lucas

Revision history for this message
Tim Penhey (thumper) wrote :

bzr break-lock lp:~username/myproject/trunk should do it.

Also please don't add unrelated comments to closed bugs or they are likely to get missed. A good place for this is to ask a question: https://launchpad.net/launchpad/+addquestion.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.