bzr branch from launchpad disconnected after exactly one hour

Bug #537116 reported by Andrew Bennetts
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
High
Unassigned

Bug Description

[lightly edited from original question]

I've been trying to branch web2py from launchpad -- I've tried with a previous bzr version and now with 2.1.0. It starts off fine, but at the one hour mark [exact to the minute], it stops doing anything. [Andrew says: This is suspiciously close to the idle_timeout in Launchpad's codehosting.]

Here's where it had gotten to when it stopped:

$ bzr branch -Dhpss -Dhpssdetail lp:~mdipierro/web2py/devel web2py
- 106420KB 29KB/s | Fetching revisions:Inserting stream

(The rate is steady at about 30KB/s up to the point where it stops.)

The bzr branch command waits for some while after activity stops, then exits with this message:

FATAL ERROR: Server unexpectedly closed network connection
bzr: ERROR: Connection closed: Unexpected end of message. Please check connectivity and permissions, and report a bug if problems persist.
HPSS calls: 7 (0 vfs) SmartSSHClientMedium(bzr+ssh://<email address hidden>/)

I have no trouble branching sahanapy, where the branch completes in under an hour.

I did bzr lp-login, so this isn't the same as question #84616. Can't tell if question #81139 is related -- I'm not using the same ssh, and the server in my case is launchpad.

I'm wondering if either bzr has a one hour timeout (I'm expecting it doesn't) or if launchpad has a process killer that nukes some types of processes after an hour (e.g. those servicing ports), on the expectation that if they hang around for an hour, they must be dead (seems more likely than a timeout in bzr).

If it is launchpad that's doing it, and if the launchpad admins don't want to up the timeout, does bzr have a checkpoint and restart mechanism for branches / checkouts / other potentially long commands?

Client is running on WinXP, in a cygwin bash terminal, with putty (pageant) ssh. Version of bzr is 2.1.0. Network connection is DSL, not fast but seems quite reliable -- I haven't seen it drop the connection (except once when my phone went dead as well -- the external phone line was out).

Below is the log, with the long middle section of "NNN byte part read" lines omitted. (In case it's relevant, the values of NNN stayed at around 400-600 for a while, then started creeping up, then started having bursts with numbers reaching just over 2000000.)

Thanks!

-- Pat

Wed 2010-03-10 14:05:20 -0800
0.140 bazaar version: 2.1.0
0.140 bzr arguments: [u'branch', u'-Dhpss', u'-Dhpssdetail', u'lp:~mdipierro/web2py/devel', u'web2py']
0.156 looking for plugins in C:/Documents and Settings/Patricia Tressel/Application Data/bazaar/2.0/plugins
0.156 looking for plugins in C:/Program Files/Bazaar/plugins
0.343 encoding stdout as sys.stdout encoding 'cp437'
2.671 hpss: Built a new medium: SmartSSHClientMedium
2.765 bzr-svn: using Subversion 1.6.6 ()
2.859 hpss call: 'BzrDir.open_2.1', '~mdipierro/web2py/devel/'
2.859 (to bzr+ssh://bazaar.launchpad.net/~mdipierro/web2py/devel/)
8.437 result: ('yes', 'no')
8.437 hpss call: 'BzrDir.open_branchV3', '~mdipierro/web2py/devel/'
8.437 (to bzr+ssh://bazaar.launchpad.net/~mdipierro/web2py/devel/)
8.656 result: ('branch', 'Bazaar Branch Format 6 (bzr 0.15)\n')
8.671 hpss call: 'BzrDir.find_repositoryV3', '~mdipierro/web2py/devel/'
8.671 (to bzr+ssh://bazaar.launchpad.net/~mdipierro/web2py/devel/)
8.875 result: ('ok', '', 'no', 'no', 'no', 'Bazaar pack repository format 1 (needs bzr 0.92)\n')
8.890 hpss call: 'Branch.get_stacked_on_url', '~mdipierro/web2py/devel/'
8.890 (to bzr+ssh://bazaar.launchpad.net/~mdipierro/web2py/devel/)
9.343 result: ('UnstackableBranchFormat', 'Branch format 6', 'lp-66552656:///~mdipierro/web2py/devel')
9.343 hpss call: 'Branch.last_revision_info', '~mdipierro/web2py/devel/'
9.343 (to bzr+ssh://bazaar.launchpad.net/~mdipierro/web2py/devel/)
9.796 result: ('ok', '1738', '<email address hidden>')
9.812 hpss call: 'BzrDir.cloning_metadir', '~mdipierro/web2py/devel/', 'False'
9.812 (to bzr+ssh://bazaar.launchpad.net/~mdipierro/web2py/devel/)
10.312 result: ('Bazaar-NG meta directory, format 1\n', 'Bazaar pack repository format 1 (needs bzr 0.92)\n', ('branch', 'Bazaar Branch Format 6 (bzr 0.15)\n'))
10.375 creating repository in file:///C:/Documents%20and%20Settings/Patricia%20Tressel/My%20Documents/sahanapy/web2py/.bzr/.
10.453 Using fetch logic to copy between RemoteRepository(bzr+ssh://bazaar.launchpad.net/~mdipierro/web2py/devel/.bzr/)(<RemoteRepositoryFormat>) and KnitPackRepository('file:///C:/Documents%20and%20Settings/Patricia%20Tressel/My%20Documents/sahanapy/web2py/.bzr/repository/')(<RepositoryFormatKnitPack1>)
10.453 hpss call w/body: 'Repository.get_stream_1.19', '~mdipierro/web2py/devel/', 'Bazaar pack repository format 1 (needs bzr 0.92)\n' ('ancestry-of\nmdipierr'...)
10.453 86 bytes
12.156 result: ('ok',)
12.156 42 byte part read
12.156 54 byte part read
12.156 548 byte part read
12.171 574 byte part read
...
3516.000 431 byte part read
3516.000 417 byte part read
3526.343 322989 byte part read
3586.671 1862204 byte part read
3589.156 61915 byte part read
3589.156 434 byte part read
3589.171 424 byte part read
3603.171 391488 byte part read
7204.296 decoder state: buf[:10]='\x00\x1e\x8a\x01B20013', state_accept=_state_accept_expecting_bytes
7204.343 Traceback (most recent call last):
  File "bzrlib\commands.pyo", line 853, in exception_to_return_code
  File "bzrlib\commands.pyo", line 1055, in run_bzr
  File "bzrlib\commands.pyo", line 661, in run_argv_aliases
  File "bzrlib\commands.pyo", line 665, in run_direct
  File "bzrlib\cleanup.pyo", line 122, in run_simple
  File "bzrlib\cleanup.pyo", line 156, in _do_with_cleanups
  File "bzrlib\builtins.pyo", line 1232, in run
  File "bzrlib\bzrdir.pyo", line 1184, in sprout
  File "bzrlib\repository.pyo", line 1704, in fetch
  File "bzrlib\decorators.pyo", line 194, in write_locked
  File "bzrlib\repository.pyo", line 3432, in fetch
  File "bzrlib\fetch.pyo", line 83, in __init__
  File "bzrlib\fetch.pyo", line 109, in __fetch
  File "bzrlib\fetch.pyo", line 137, in _fetch_everything_for_search
  File "bzrlib\repository.pyo", line 4274, in insert_stream
  File "bzrlib\repository.pyo", line 4307, in _locked_insert_stream
  File "bzrlib\knit.pyo", line 1599, in insert_record_stream
  File "bzrlib\versionedfile.pyo", line 1665, in read
  File "bzrlib\smart\repository.pyo", line 578, in iter_substream_bytes
  File "bzrlib\smart\repository.pyo", line 568, in iter_stream_decoder
  File "bzrlib\smart\message.pyo", line 336, in read_streamed_body
  File "bzrlib\smart\message.pyo", line 286, in _read_more
ConnectionReset: Connection closed: Unexpected end of message. Please check connectivity and permissions, and report a bug if problems persist.

7204.343 Transferred: 106455KiB (14.8K/s r:106454K w:1K)
7204.343 return code 3
[ 4476] 2010-03-10 16:05:24.342 INFO: HPSS calls: 7 (0 vfs) SmartSSHClientMedium(bzr+ssh://<email address hidden>/)

Tags: lp-code
Andrew Bennetts (spiv)
affects: bzr → launchpad-code
Revision history for this message
Pat Tressel (ptressel) wrote :

Andrew -- thanks for posting this!

Paul Hummer (rockstar)
Changed in launchpad-code:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

It's pretty likely that you're being cut off by the idle timeout, but not at all clear why the connection was idle for an hour. That really shouldn't happen. You say this is repeatable?

Are you branching into a shared repo? I could just about imagine that a cross-format fetch might stall like this.

Revision history for this message
Tim Penhey (thumper) wrote : Re: [Bug 537116] Re: bzr branch from launchpad disconnected after exactly one hour

On Mon, 15 Mar 2010 17:00:10 Michael Hudson wrote:
> It's pretty likely that you're being cut off by the idle timeout, but
> not at all clear why the connection was idle for an hour. That really
> shouldn't happen. You say this is repeatable?
>
> Are you branching into a shared repo? I could just about imagine that a
> cross-format fetch might stall like this.

Also the first branch into an empty shared repo is extremely slow for a large
tree.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Tim Penhey wrote:
> On Mon, 15 Mar 2010 17:00:10 Michael Hudson wrote:
>> It's pretty likely that you're being cut off by the idle timeout, but
>> not at all clear why the connection was idle for an hour. That really
>> shouldn't happen. You say this is repeatable?
>>
>> Are you branching into a shared repo? I could just about imagine that a
>> cross-format fetch might stall like this.
>
> Also the first branch into an empty shared repo is extremely slow for a large
> tree.

Yes, but the connection isn't idle all that time -- it's messages going
back and forth to walk over the revision graph that causes the slowness
AIUI.

Cheers,
mwh

Revision history for this message
Pat Tressel (ptressel) wrote : Re: [Bug 537116] Re: bzr branch from launchpad disconnected after exactly one hour

Michael, Tim --

> It's pretty likely that you're being cut off by the idle timeout, but
> not at all clear why the connection was idle for an hour.  That really
> shouldn't happen.  You say this is repeatable?

The connection wasn't idle -- the progress line was updating steadily
right up to the hour mark, then stopped.

I tried it 3-4 times, IIRC, and all hung at an hour. Also (as I now
know from trying the suggested workaround of splitting the branch up
into an initial revision-limited branch followed by pulls with
progressively higher revision limits) that a pull time out if the
revisions it's inserting are big enough. That happened twice.

> Are you branching into a shared repo?  I could just about imagine
> that a cross-format fetch might stall like this.

I didn't use a shared repo -- I only expect to need one web2py branch.

-- Pat

Revision history for this message
Pat Tressel (ptressel) wrote :

Wondering if there's any progress on the one hour timeout issue? I just tried bzr branch with web2py again, and it hung and then timed out an hour later, just as in this report.

Note this branch is into an empty directory. Here's where it stopped:
$ bzr branch --use-existing-dir lp:~mdipierro/web2py/devel web2py
- 104344KB 30KB/s | Fetching revisions:Inserting stream

And an hour later:
FATAL ERROR: Server unexpectedly closed network connection
bzr: ERROR: Connection closed: Unexpected end of message. Please check connectiv
ity and permissions, and report a bug if problems persist.

IIRC the conclusion before was that there was a very large revision, and a lot of behind the scenes activity, but not what launchpad monitors to determine if the connection is idle. In the original case, the existence of a large revision was confirmed, and the same hang and timeout was reproduced on a single revision, by doing a bzr branch to rev 1, then bzr pull to extract individual revsions, one at a time.

My network connection is DSL but not a really expensive high-speed line. The last time I needed to branch web2py, the "workaround" was to find someone with a very high-speed connection who did the branch, tarred up the result and sent it to me.

Proposed solution was to have lp monitor both ports (or however the activity is partitioned). Proposed temporary workaround was to increase the "idle" timeout.

Revision history for this message
Martin Pool (mbp) wrote :

One contributing factor here is that the branch lp:~mdipierro/web2py/devel is in an old bzr format and it will be faster for all involved if the owner upgrades it.

To summarize, there are three phases:

0-60m bzr slowly pulls stuff
60-120m bzr apparently stuck doing nothing
120m server disconnects the client

It's not surprising that the client's disconnected after an hour of inactivity; in fact that's a feature. The real problem is that we hang for an hour before that. It would be highly useful to get a traceback during that phase.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 537116] Re: bzr branch from launchpad disconnected after exactly one hour

Its possibly the 'convert revs on server' if the target branch is a 2a format.

Revision history for this message
Martin Pool (mbp) wrote :

fwiw on my home ADSL connection, I can branch from there in about 18m, pulling down 600MB of revision history, without apparently hanging at any point. I could try slowing it down and see what happens after an hour.

Revision history for this message
Martin Pool (mbp) wrote :

Pat, just to confirm there's no destination repository, can you please
run 'bzr info' in the directory where you're trying to make your new
branch.

Revision history for this message
Pat Tressel (ptressel) wrote :

On Mon, Sep 27, 2010 at 8:56 PM, Martin Pool <email address hidden> wrote:

> One contributing factor here is that the branch
> lp:~mdipierro/web2py/devel is in an old bzr format and it will be faster
> for all involved if the owner upgrades it.
>
> To summarize, there are three phases:
>
> 0-60m bzr slowly pulls stuff
> 60-120m bzr apparently stuck doing nothing
> 120m server disconnects the client
>
> It's not surprising that the client's disconnected after an hour of
> inactivity; in fact that's a feature.
>

> The real problem is that we hang
> for an hour before that. It would be highly useful to get a traceback
> during that phase.
>

That's not what I understood from the investigation that has already been
done, but I could be wrong... It sounds as though there is *no* period of
inactivity. There are *not* two separate one hour intervals involved.
Rather, the bzr activity that is going on is not regarded as activity by
launchpad. During the one hour of what launchpad regards as inactivity, bzr
is busy doing something relating to a single enormous revision, which is
possibly a collection of a large number of merged revisions.

So rather than three phases:

0 to N hours: bzr is visibly making progress.
N to N+1 hours: bzr stops visibly making progress.
N+1 hour: bzr exits with error.

-- Pat

Revision history for this message
Pat Tressel (ptressel) wrote :

On Mon, Sep 27, 2010 at 9:29 PM, Martin Pool <email address hidden> wrote:

> Pat, just to confirm there's no destination repository, can you please
> run 'bzr info' in the directory where you're trying to make your new
> branch.
>

Too late -- I needed that exact directory to be functional (it's hard-linked
from elsewhere so can't change it). So it's already had a bzr checkout done
into it.

-- Pat

Revision history for this message
Pat Tressel (ptressel) wrote :

On Mon, Sep 27, 2010 at 9:28 PM, Martin Pool <email address hidden> wrote:

> fwiw on my home ADSL connection, I can branch from there in about 18m,
> pulling down 600MB of revision history, without apparently hanging at
> any point. I could try slowing it down and see what happens after an
> hour.
>

Then your connection is considerably faster than mine.

bzr was reporting rates of 30KB/sec on mine.

-- Pat

Revision history for this message
Martin Pool (mbp) wrote :

@#11 I think that's the timeline I was describing too. I don't think it's ever more than 1 hour of actual useful work, is it? Then there's one hour stuck, then that phase is terminating by the whole program erroring.

I just tested this with an additional 1000ms latency on eth0, and it completed in 43m, probably not enough to provoke whatever's happening at the 1h mark. I would have thought that if there was something CPU-bound that caused a long hang, it would show up in this. This branching used only 36s cpu time locally, so it seems unlikely that could blow out to 1h of work unless your machine is much slower. It may be you have some kind of environment where things do become much worse.

Do you now have a working branch or are you still blocked?

Revision history for this message
Pat Tressel (ptressel) wrote :

> @#11 I think that's the timeline I was describing too. I don't think
> it's ever more than 1 hour of actual useful work, is it?

Considerably more than 1 hour. IIRC it ran for about 5-6 hours before it
quit updating. And we did track it down to a single huge revision. During
the previous investigation, I pulled one revision at a time. I don't have
the revision #.

> Then there's one hour stuck,

Or one hour spent fussing with that one huge revision. I believe that this
is also documented -- these is a description of what it's doing during that
time.

> then that phase is terminating by the whole program erroring.
>

And then it gets disconnected at the launchpad end. The local bzr just
reports that.

-- Pat

Revision history for this message
Andrew Bennetts (spiv) wrote :

This is likely to have been caused by bug 556132, which is now fixed. Searching google for "ssh rekey hour" finds plenty of FAQs about various SSH clients (e.g. SSH Corp's client) that automatically initiate a rekey after exactly one hour by default, perhaps current OpenSSH does too, and that would have provoked this problem. So perhaps we should close this bug.

Revision history for this message
Martin Pool (mbp) wrote :

I agree this is very likely a dupe. It has been fixed on Launchpad and does not need an upgrade to your bzr client. If it happens again, please let us know.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.