MemoryError/OverflowError in _read_info_file during branch unlock when pushing over SFTP

Bug #255292 reported by domachine
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Bazaar
Invalid
High
Unassigned
Launchpad itself
Invalid
High
Unassigned

Bug Description

bzr: ERROR: exceptions.OverflowError: long int too large to convert to int

Traceback (most recent call last):
  File "/usr/lib/python2.5/site-packages/bzrlib/commands.py", line 834, in run_bzr_catch_errors
    return run_bzr(argv)
  File "/usr/lib/python2.5/site-packages/bzrlib/commands.py", line 790, in run_bzr
    ret = run(*run_argv)
  File "/usr/lib/python2.5/site-packages/bzrlib/commands.py", line 492, in run_argv_aliases
    return self.run(**all_cmd_args)
  File "/usr/lib/python2.5/site-packages/bzrlib/builtins.py", line 840, in run
    stop_revision=revision_id)
  File "/usr/lib/python2.5/site-packages/bzrlib/decorators.py", line 127, in read_locked
    return unbound(self, *args, **kwargs)
  File "/usr/lib/python2.5/site-packages/bzrlib/branch.py", line 1559, in push
    target.unlock()
  File "/usr/lib/python2.5/site-packages/bzrlib/branch.py", line 1344, in unlock
    self.control_files.unlock()
  File "/usr/lib/python2.5/site-packages/bzrlib/lockable_files.py", line 270, in unlock
    self._lock.unlock()
  File "/usr/lib/python2.5/site-packages/bzrlib/lockdir.py", line 307, in unlock
    self.confirm()
  File "/usr/lib/python2.5/site-packages/bzrlib/lockdir.py", line 395, in confirm
    info = self.peek()
  File "/usr/lib/python2.5/site-packages/bzrlib/lockdir.py", line 418, in peek
    info = self._read_info_file(self._held_info_path)
  File "/usr/lib/python2.5/site-packages/bzrlib/lockdir.py", line 408, in _read_info_file
    return self._parse_info(self.transport.get(path))
  File "/usr/lib/python2.5/site-packages/bzrlib/transport/sftp.py", line 238, in get
    f = self._get_sftp().file(path, mode='rb')
  File "/var/lib/python-support/python2.5/paramiko/sftp_client.py", line 221, in open
    t, msg = self._request(CMD_OPEN, filename, imode, attrblock)
  File "/var/lib/python-support/python2.5/paramiko/sftp_client.py", line 572, in _request
    return self._read_response(num)
  File "/var/lib/python-support/python2.5/paramiko/sftp_client.py", line 602, in _read_response
    t, data = self._read_packet()
  File "/var/lib/python-support/python2.5/paramiko/sftp.py", line 180, in _read_packet
    data = self._read_all(size)
  File "/var/lib/python-support/python2.5/paramiko/sftp.py", line 163, in _read_all
    x = self.sock.recv(n)
  File "/usr/lib/python2.5/site-packages/bzrlib/transport/ssh.py", line 612, in recv
    return os.read(self.proc.stdout.fileno(), count)
OverflowError: long int too large to convert to int

bzr 1.3.1 on python 2.5.2.final.0 (linux2)
arguments: ['/usr/bin/bzr', 'push']
encoding: 'UTF-8', fsenc: 'UTF-8', lang: 'de_DE.UTF-8'
plugins:
  launchpad /usr/lib/python2.5/site-packages/bzrlib/plugins/launchpad [unknown]

Tags: lp-code
Revision history for this message
John A Meinel (jameinel) wrote :

Interesting, as this seems to indicate a single "recv" of more than 2GB (2^31).

I'm very surprised for this to be failing as part of "recv" since that would indicate a pull, and not a push.

I assume "get-you" is a project? Could you give any more details?

Changed in bzr:
status: New → Incomplete
Revision history for this message
Andrew Bennetts (spiv) wrote :

John, recv's happen during push too. For example in this traceback it is happening while reading a lockdir's info file.

Looking closely at the traceback, the large size actually comes from this bit of paramiko:

    def _read_packet(self):
        size = struct.unpack('>I', self._read_all(4))[0]
        data = self._read_all(size)

So the SSH server is sending a ridiculously huge packet. Possibly this is because the info file we're retrieving over SFTP is actually that massive, or maybe this is just a bug in that SSH server?

EIther way, my guess is the root problem here isn't bzr's fault, although obviously bzr should be a bit more graceful in how it handles this situation. That's just a guess, though.

cash5, can you tell us what OS and filesystem and sshd is running on the server? Are there any "info" files in the subdirectories of the .bzr/branch/lock directory on the server, and if so what size are they?

Revision history for this message
John A Meinel (jameinel) wrote :

Sure, I realize it is bidirectional communication. ...

I just thought of something, though. Could this be when triggering an auto-pack?

Even if it was, I would still be surprised for something like this to be transmitting >2GB in a single packet.

Also, I didn't think sending packets that large was even allowed for sftp (perhaps this is just raw ssh?) At least, what I remember of the sftp spec was that you couldn't guarantee readv > 32K (though openssh allowed 64k requests for readv)

I'm somewhat tempted to say bogus sftp server.

Also look closer at the traceback, and you can see that we are reading the lock/held/info file. I think what is happening is that we are peeking at a file that doesn't exist, and the SFTP server is sending back garbage, rather than a "ENOENT".

 File "/usr/lib/python2.5/site-packages/bzrlib/lockdir.py", line 395, in confirm
    info = self.peek()
  File "/usr/lib/python2.5/site-packages/bzrlib/lockdir.py", line 418, in peek
    info = self._read_info_file(self._held_info_path)
  File "/usr/lib/python2.5/site-packages/bzrlib/lockdir.py", line 408, in _read_info_file
    return self._parse_info(self.transport.get(path))

At a minimum, the 'info' file should be on the order of *maybe* 10kB. Certainly nowhere near 2GB in size.

Revision history for this message
Andrew Bennetts (spiv) wrote :

I see "get-you" is hosted on Launchpad, so I guess this is likely to be due to the bazaar.launchpad.net SFTP server. I'll mark this bug as also affecting launchpad-bazaar; I suspect a Conch bug.

Revision history for this message
Andrew Bennetts (spiv) wrote :

FWIW, I looked on the server and I don't see any lockdirs for the get-you hosted branches at the moment. So if there was something funny about the info file there isn't anymore.

Revision history for this message
domachine (dominik-burgdoerfer) wrote :

Yeah it's a little bit strange with that problem. Because somethings it works but then not again.
All I do, if the problem occurs, is to break the lock with `bzr break-lock sftp://dominik.burgdoerfer@...' and then next time, the push works.

Revision history for this message
domachine (dominik-burgdoerfer) wrote :

Oh I forgot to say that the error also occurs if my workmate tries to push a new revision from his computer.

Revision history for this message
Andrew Bennetts (spiv) wrote :

cash5: that's interesting, because break-lock should be fetching the exact same file! I guess that means it's an intermittent problem. Hmm. Thanks for the information!

Revision history for this message
Andrew Bennetts (spiv) wrote :

Separately I recently discovered that OpenSSH's SFTP client will tend to give error messages like "received message too long 1397966893" when the data read isn't actually from an SFTP server.

This makes me wonder if something other than SFTP data is getting written to that channel (or whatever the SSH protocol jargon is). Non-SFTP data would tend to trigger this sort of error (i.e. packets with impossibly huge length prefixes).

It would help to know what the offending int actually is. In the "received message too long 1397966893" error I saw elsewhere, it turns out that 1397966893 is what you get if you treat the ascii string "SSH-" as a 4-byte int. If we knew what ints were causing the MemoryError/OverflowErrors for this bug, it might be a very helpful clue.

Revision history for this message
domachine (dominik-burgdoerfer) wrote : Re: [Bug 255292] Re: MemoryError/OverflowError in _read_info_file during branch unlock when pushing over SFTP

On Wednesday 13 August 2008 01:39:32 you wrote:
> Separately I recently discovered that OpenSSH's SFTP client will tend to
> give error messages like "received message too long 1397966893" when the
> data read isn't actually from an SFTP server.
>
> This makes me wonder if something other than SFTP data is getting
> written to that channel (or whatever the SSH protocol jargon is). Non-
> SFTP data would tend to trigger this sort of error (i.e. packets with
> impossibly huge length prefixes).
>
> It would help to know what the offending int actually is. In the
> "received message too long 1397966893" error I saw elsewhere, it turns
> out that 1397966893 is what you get if you treat the ascii string "SSH-"
> as a 4-byte int. If we knew what ints were causing the
> MemoryError/OverflowErrors for this bug, it might be a very helpful
> clue.

Ahm sorry I don't really understand. You wanna have the files uploaded by me
to the server? Or do you want the size of them? Sorry about my weak english
but I'm from Germany. I'm looking forward to help you.

Dominik

Revision history for this message
Jonathan Lange (jml) wrote :

Andrew seems to have this in hand. Updating the status and assigning to him.

Changed in launchpad-bazaar:
assignee: nobody → spiv
status: New → Incomplete
Revision history for this message
Andrew Bennetts (spiv) wrote :

I replied to cash5's comment (comment #10) via private email, as that's where I initially received it. For the benefit of other people reading this bug, here's my reply:

"""
Neither. I want to know what the bad SFTP packet sent over the network looks
like.

If you are comfortable with debugging Python, then you could perhaps set
BZR_PDB=1 in your environment to get a pdb prompt when this error happens, and
then see if you can find out what the last bytes read by paramiko were.

Or perhaps just edit your install of bzr to display a more information when this
happens. E.g. apply a patch like:

=== modified file 'bzrlib/transport/ssh.py'
--- bzrlib/transport/ssh.py 2008-04-24 07:22:53 +0000
+++ bzrlib/transport/ssh.py 2008-08-14 06:20:41 +0000
@@ -602,7 +602,11 @@
         return os.write(self.proc.stdin.fileno(), data)

     def recv(self, count):
- return os.read(self.proc.stdout.fileno(), count)
+ try:
+ return os.read(self.proc.stdout.fileno(), count)
+ except:
+ print '** Error occurred reading %r bytes' % (count,)
+ raise

     def close(self):
         self.proc.stdin.close()

Thanks for any help!
"""

Revision history for this message
Jonathan Lange (jml) wrote :

I'm getting this bug with info over sftp.

I didn't quite apply your patch Andrew, I just put a print count at the top of recv. Here's the last few numbers:

32777
4
32777
4
32777
4
32777
4
32777
4
1714499681
1714466917

Boom!

Revision history for this message
John A Meinel (jameinel) wrote :

Well, something is seriously wrong if we are asking for 1,714,499,681 bytes from _read_info_file.

However, this looks a lot more like we are reading the indexes, etc. Which is valid. I'm still a bit surprised to see a 1.7GB request. Is that at all valid for the size of project you are working with?

Revision history for this message
Jonathan Lange (jml) wrote :

On Mon, Sep 22, 2008 at 11:53 PM, John A Meinel <email address hidden> wrote:
> Well, something is seriously wrong if we are asking for 1,714,499,681
> bytes from _read_info_file.
>
> However, this looks a lot more like we are reading the indexes, etc.
> Which is valid. I'm still a bit surprised to see a 1.7GB request. Is
> that at all valid for the size of project you are working with?
>

No. The repository is ~600MB and the branch itself is ~2MB.

Revision history for this message
Andrew Bennetts (spiv) wrote :

John, my suspicion is that bzr isn't actually doing anything wrong (despite where this bug is filed), but instead that the SFTP conversation between paramiko and bazaar.launchpad.net (which is basically Twisted's Conch server) is going awry. i.e. the traceback doesn't seem to be saying that bzr is asking for 1,714,499,681 bytes, but instead that the SFTP client has received an SSH or SFTP packet with a length-prefix claiming to be that huge, apparently in response to a file open request.

jml, can you explain how you were able to reproduce this? I'm intrigued by your mention of "info over sftp"; the traceback in the original report is triggered during a branch unlock, which reads the .bzr/branch/lock/held/info file (to find out if the lock was broken while this client thought it held it). But 'bzr info' shouldn't acquire a write lock. If your tracebacks vary substantially to the original report it's probably worth attaching them to this report.

FWIW:

>>> struct.pack('>I', 1714499681)
'f10a'
>>> struct.pack('>I', 1714466917)
'f0\xb0e'

Which despite being mostly ascii bytes, doesn't look like anything I recognise. It would be good to somehow get a complete capture of the SFTP conversation in case that makes the problem more obvious...

One other thing I notice: paramiko's _read_packet method reads a 4-byte length-prefix, then reads the rest of the packet all at once; i.e. it calls: _read_all(4), _read_all(size). So that's why you see reads of 4, 32777, 4, 32777, 4, 32777, etc. But interestingly, not only are the final read sizes impossibly huge, they break this pattern:

"""
4
1714499681
1714466917
"""

Why didn't it read 4 bytes after reading the 1714499681 bytes?

Revision history for this message
Jonathan Lange (jml) wrote :
Download full text (6.3 KiB)

$ time bzr info -v sftp://bazaar.launchpad.net/~launchpad/launchpad/trunk
Repository branch (format: 1.6)
Location:
  shared repository: sftp://bazaar.launchpad.net/%7Elaunchpad/launchpad/trunk/
  repository branch: sftp://bazaar.launchpad.net/%7Elaunchpad/launchpad/trunk/

Related branches:
  parent branch: /code/rocketfuel/launchpad/devel

Format:
       control: Meta directory format 1
        branch: Branch format 7
    repository: Packs 5 (adds stacking support, requires bzr 1.6)

Branch history:
      7034 revisions
bzr: ERROR: exceptions.MemoryError:

Traceback (most recent call last):
  File "/home/jml/Code/Bazaar/bzr.dev-2.5/bzrlib/commands.py", line 857, in run_bzr_catch_errors
    return run_bzr(argv)
  File "/home/jml/Code/Bazaar/bzr.dev-2.5/bzrlib/commands.py", line 797, in run_bzr
    ret = run(*run_argv)
  File "/home/jml/Code/Bazaar/bzr.dev-2.5/bzrlib/commands.py", line 499, in run_argv_aliases
    return self.run(**all_cmd_args)
  File "/home/jml/Code/Bazaar/bzr.dev-2.5/bzrlib/commands.py", line 818, in ignore_pipe
    result = func(*args, **kwargs)
  File "/home/jml/Code/Bazaar/bzr.dev-2.5/bzrlib/builtins.py", line 1076, in run
    verbose=noise_level, outfile=self.outf)
  File "/home/jml/Code/Bazaar/bzr.dev-2.5/bzrlib/info.py", line 346, in show_bzrdir_info
    outfile)
  File "/home/jml/Code/Bazaar/bzr.dev-2.5/bzrlib/info.py", line 379, in show_component_info
    stats = _show_branch_stats(branch, verbose==2, outfile)
  File "/home/jml/Code/Bazaar/bzr.dev-2.5/bzrlib/info.py", line 277, in _show_branch_stats
    stats = branch.repository.gather_stats(head, committers=verbose)
  File "/home/jml/Code/Bazaar/bzr.dev-2.5/bzrlib/decorators.py", line 138, in read_locked
    result = unbound(self, *args, **kwargs)
  File "/home/jml/Code/Bazaar/bzr.dev-2.5/bzrlib/repository.py", line 818, in gather_stats
    revisions = self.get_ancestry(revid)
  File "/home/jml/Code/Bazaar/bzr.dev-2.5/bzrlib/decorators.py", line 138, in read_locked
    result = unbound(self, *args, **kwargs)
  File "/home/jml/Code/Bazaar/bzr.dev-2.5/bzrlib/repository.py", line 1700, in get_ancestry
    found, ghosts = search.next_with_ghosts()
  File "/home/jml/Code/Bazaar/bzr.dev-2.5/bzrlib/graph.py", line 1242, in next_with_ghosts
    self._advance()
  File "/home/jml/Code/Bazaar/bzr.dev-2.5/bzrlib/graph.py", line 1252, in _advance
    found, ghosts, next, parents = self._do_query(self._next_query)
  File "/home/jml/Code/Bazaar/bzr.dev-2.5/bzrlib/graph.py", line 1275, in _do_query
    parent_map = self._parents_provider.get_parent_map(revisions)
  File "/home/jml/Code/Bazaar/bzr.dev-2.5/bzrlib/graph.py", line 132, in get_parent_map
    new_parents = self._real_provider.get_parent_map(needed)
  File "/home/jml/Code/Bazaar/bzr.dev-2.5/bzrlib/repository.py", line 1764, in get_parent_map
    self.revisions.get_parent_map(query_keys).iteritems():
  File "/home/jml/Code/Bazaar/bzr.dev-2.5/bzrlib/knit.py", line 1073, in get_parent_map
    return self._get_parent_map_with_sources(keys)[0]
  File "/home/jml/Code/Bazaar/bzr.dev-2.5/bzrlib/knit.py", line 1092, in _get_parent_map_with_sources
    new_result = source.get_parent_map(missing)
  File "/home/jml/Code/Bazaar/bzr...

Read more...

Revision history for this message
Andrew Bennetts (spiv) wrote :

Thanks for that.

So your traceback is pretty different, if I had to guess I'd say it's probably the same bug, but I'm not sure of that.

It's interesting that the other reports so far have all involved _read_info_file, but yours takes a completely different path (_buffer_all). It's convenient that you can reproduce this with a read-only operation, though.

I still have no idea what the problem is though. If you have time, could you capture the SFTP conversation (as seen by the bzr client talking its the openssh subprocess)? Or if you can reproduce it locally it might be interesting to capture from Conch's side.

Revision history for this message
John A Meinel (jameinel) wrote :

I'll note that "bzr info" *does* read the lock information. It does so to report if the branch/repository is currently locked and who is locking it. I don't think it actually takes out a real-lock during that time, though.

Revision history for this message
Jonathan Lange (jml) wrote :

Andrew, are you still working on this bug? Are you blocked on anything from me?

If the answer to the first question is "no", I'm happy to mark as invalid. If the answer to the second question is "yes", then we'll see what needs to be done.

Revision history for this message
domachine (dominik-burgdoerfer) wrote :

Am Donnerstag 05 Februar 2009 07:34:06 schrieb Jonathan Lange:
> Andrew, are you still working on this bug? Are you blocked on anything
> from me?
>
> If the answer to the first question is "no", I'm happy to mark as
> invalid. If the answer to the second question is "yes", then we'll see
> what needs to be done.

Oh sorry,
but the bug resolved himself. I don't know how.
But it did :-D

Sorry for the delay.

Greetings Dominik

Revision history for this message
Andrew Bennetts (spiv) wrote :

Jono:

The answers are indeed "no" and "yes". This bug is still a mystery to me. The information I asked for in an earlier comment would still be interesting:

"If you have time, could you capture the SFTP conversation (as seen by the bzr client talking its the openssh subprocess)? Or if you can reproduce it locally it might be interesting to capture from Conch's side."

Basically, without a trace of the SFTP conversation it's going to be next to impossible to diagnose this, I think. With a trace it should be possible to replay the conversation against both paramiko and Conch and determine which side is screwing up.

Given that there are no other reports this bug, if you can't reproduce it anymore then we should probably just close this bug and assume it got fixed by accident somehow.

Revision history for this message
Jonathan Lange (jml) wrote :

Here's a wireshark capture.

Revision history for this message
Andrew Bennetts (spiv) wrote :

It'd be a bit more useful to have an unencrypted trace of the SFTP conversation ;)

Changed in launchpad:
importance: Undecided → High
Changed in bzr:
status: Incomplete → Triaged
importance: Undecided → High
Changed in launchpad:
status: Incomplete → Triaged
Changed in bzr:
assignee: nobody → Andrew Bennetts (spiv)
Revision history for this message
Andrew Bennetts (spiv) wrote :

Robert: why is this no longer incomplete? Without more information, like traces of the unencrypted SFTP session, there's nothing more we can do. Also, the last report of this problem is almost 2 years old, so this problem might not even exist anymore.

Finally, I'm certainly not currently working on this, nor do I plan to, so it shouldn't be assigned to me.

Changed in bzr:
assignee: Andrew Bennetts (spiv) → nobody
Revision history for this message
Robert Collins (lifeless) wrote :

Thats fine Andrew, you *were already* assigned to it in the LP bugtask, so I'll unassign you there as well (I assigned you in bzr for *consistency* with the existing assignment, not to try to get you working on it :))

I agree that it may not exist and we can close if thats what we think is best. We're using 'incomplete' as a bit more aggressive than merely we need more data - its more 'we are happy to close this bug if noone answers in 2 months' (due to the bug garbage collector). This was a potentially serious issue after all :).

Changed in launchpad:
assignee: Andrew Bennetts (spiv) → nobody
Revision history for this message
Martin Pool (mbp) wrote :

I think if this bug still existed, we probably would have heard about it in the last two years. Therefore unless we get further information, we should just assume it was fixed, perhaps by an upgrade to conch, twisted, openssh, or something similar.

Changed in bzr:
status: Triaged → Invalid
Changed in launchpad:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.