EADDR inside pyrex readdir on x86 Solaris

Bug #297831 reported by Martin Pool
4
Affects Status Importance Assigned to Milestone
Bazaar
Fix Released
Medium
Unassigned

Bug Description

Bonnie Lucas in <https://lists.ubuntu.com/archives/bazaar/2008q4/049442.html> reports that some bzr commands fail with EADDR in readdir_pyx; I think this is because the pyrex extension is passing a bad address to the kernel:

Here the is the result of the command:

[13] > python -c "import traceback"
[14] >

I am running SunOS 5.10 Generic_127128-11 i86pc i386 i86pc

Python version 2.5.1

Here is the output from the original failure without the debugging
information

> mkdir testfoobear
> cd testfoobear
> echo "Testing" > bearfile1
> bzr init
Standalone tree (format: pack-0.92)
Location:
 branch root: .
> bzr add
added bearfile1
> bzr commit -m "first try"
Committing to: /home/lucas/testfoobear/
added bearfile1
Committed revision 1.
> ls -la
total 16
drwx------ 3 lucas cserv 512 Nov 12 10:59 .
drwxr-xr-x 49 lucas cserv 4608 Nov 12 10:59 ..
-rw------- 1 lucas cserv 8 Nov 12 10:59 bearfile1
drwx------ 6 lucas cserv 512 Nov 12 10:59 .bzr
> cd ..
> mkdir testfoobear2
> cd testfoobear2
> bzr branch ../testfoobear
bzr: ERROR: [Errno 14] Bad address

Another item I discovered today was that it appears that sometime the file
is deleted from the original directory. But that did not happen in this
case.

0.076 Plugin name __init__ already loaded
   accelerator_tree, hardlink)
 File "/usr/local/lib/python2.5/site-packages/bzrlib/transform.py", line
2042,
in _create_files
   in iter if not (c or e[0] != e[1]))
 File "/usr/local/lib/python2.5/site-packages/bzrlib/transform.py", line
2041,
in <genexpr>
   unchanged = dict((f, p[1]) for (f, p, c, v, d, n, k, e)
 File "_dirstate_helpers_c.pyx", line 1347, in
_dirstate_helpers_c.ProcessEntry
C.__next__
 File "_dirstate_helpers_c.pyx", line 1511, in
_dirstate_helpers_c.ProcessEntry
C._iter_next
 File "_dirstate_helpers_c.pyx", line 1488, in
_dirstate_helpers_c._iter_next
 File "/usr/local/lib/python2.5/site-packages/bzrlib/osutils.py", line
1313, in
 _walkdirs_utf8
   dirblock = sorted(read_dir(relroot, top))
 File "_readdir_pyx.pyx", line 229, in _readdir_pyx.UTF8DirReader.read_dir
 File "_readdir_pyx.pyx", line 344, in _readdir_pyx._read_dir
OSError: [Errno 14] Bad address

0.573 return code 3

Tags: solaris pyrex

Related branches

Revision history for this message
Martin Pool (mbp) wrote : Re: problems running bzr

On Fri, Nov 14, 2008 at 1:53 AM, Bonnie Lucas <email address hidden> wrote:
> I could not find traceback in the ~/.bzr.log. Here is the the last
> lines bzr.log
>
> 0.076 Plugin name __init__ already loaded
> accelerator_tree, hardlink)
> File "/usr/local/lib/python2.5/site-packages/bzrlib/transform.py", line
> 2042,
> in _create_files
> in iter if not (c or e[0] != e[1]))
> File "/usr/local/lib/python2.5/site-packages/bzrlib/transform.py", line
> 2041,
> in <genexpr>
> unchanged = dict((f, p[1]) for (f, p, c, v, d, n, k, e)
> File "_dirstate_helpers_c.pyx", line 1347, in
> _dirstate_helpers_c.ProcessEntry
> C.__next__
> File "_dirstate_helpers_c.pyx", line 1511, in
> _dirstate_helpers_c.ProcessEntry
> C._iter_next
> File "_dirstate_helpers_c.pyx", line 1488, in
> _dirstate_helpers_c._iter_next
> File "/usr/local/lib/python2.5/site-packages/bzrlib/osutils.py", line
> 1313, in
> _walkdirs_utf8
> dirblock = sorted(read_dir(relroot, top))
> File "_readdir_pyx.pyx", line 229, in _readdir_pyx.UTF8DirReader.read_dir
> File "_readdir_pyx.pyx", line 344, in _readdir_pyx._read_dir
> OSError: [Errno 14] Bad address
>
> 0.573 return code 3

That's the traceback, thanks. I filed
<https://bugs.launchpad.net/bzr/+bug/297831> regarding this. It looks
like there is a bug, probably a portability bug, in readdir._pyx, in
how it's dealing with the OS.

You may be able to work around this for the moment by moving the
_readdir_pyx.so file in the bzlib directory, eg to
_readdir_pyx.so.bad. It would be helpful to know if that fixes it.

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message
Martin Pool (mbp) wrote :

One other thing that would help us debug this would be if you'd run
bzr under strace (or truss, or the equivalent on Solaris) and attach
the output file to that bug. Then we can see more about just which
system call is failing.

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message
Martin Pool (mbp) wrote :

And one more thing: we did fix a bug in this routine in bzr 1.8 (bug
279831
), not producing quite the same symptoms, but it might be
related. So it'd be good if you could upgrade to either 1.8 or 1.9,
if you're not already there.

Thanks,
Martin

Martin Pool (mbp)
Changed in bzr:
importance: Undecided → Medium
status: New → Incomplete
Revision history for this message
Harry Hirsch (bzr-unbunt) wrote :

Got the same bug under SunOS 5.10 Generic_127112-11 i86pc i386 i86pc with Python 2.6 and version 1.9 of bzr.

Moving _readdir_pyx.so out of the way as suggested fixes the problem for me.

A chdir() syscall with an empty string as argument (directory name) seems to cause the error.

Here is the relevant bzr.log snippet:

-----------
Wed 2008-11-19 14:16:27 +0100
0.060 bzr arguments: [u'st']
0.060 looking for plugins in /home/xxxx/.bazaar/plugins
0.060 looking for plugins in /home/xxxx/opt/bzr1.9/lib/python/bzrlib/plugins
0.139 looking for plugins in /home/xxxx/opt/python2.6/lib/python2.6/site-packages/bzrlib/plugins
0.145 encoding stdout as sys.stdin encoding 'UTF-8'
0.217 opening working tree '/var.local/home/xxxx/repo'
0.224 check paths: None
0.234 Traceback (most recent call last):
  File "/home/xxxx/opt/bzr1.9/lib/python/bzrlib/commands.py", line 893, in run_bzr_catch_errors
    return run_bzr(argv)
  File "/home/xxxx/opt/bzr1.9/lib/python/bzrlib/commands.py", line 839, in run_bzr
    ret = run(*run_argv)
  File "/home/xxxx/opt/bzr1.9/lib/python/bzrlib/commands.py", line 539, in run_argv_aliases
    return self.run(**all_cmd_args)
  File "/home/xxxx/opt/bzr1.9/lib/python/bzrlib/commands.py", line 853, in ignore_pipe
    result = func(*args, **kwargs)
  File "/home/xxxx/opt/bzr1.9/lib/python/bzrlib/builtins.py", line 217, in run
    show_pending=(not no_pending))
  File "/home/xxxx/opt/bzr1.9/lib/python/bzrlib/status.py", line 114, in show_tree_status
    want_unversioned=want_unversioned)
  File "/home/xxxx/opt/bzr1.9/lib/python/bzrlib/tree.py", line 95, in changes_from
    want_unversioned=want_unversioned,
  File "/home/xxxx/opt/bzr1.9/lib/python/bzrlib/decorators.py", line 138, in read_locked
    result = unbound(self, *args, **kwargs)
  File "/home/xxxx/opt/bzr1.9/lib/python/bzrlib/tree.py", line 814, in compare
    want_unversioned=want_unversioned)
  File "/home/xxxx/opt/bzr1.9/lib/python/bzrlib/delta.py", line 217, in _compare_trees
    want_unversioned=want_unversioned):
  File "_dirstate_helpers_c.pyx", line 1347, in _dirstate_helpers_c.ProcessEntryC.__next__
  File "_dirstate_helpers_c.pyx", line 1511, in _dirstate_helpers_c.ProcessEntryC._iter_next
  File "_dirstate_helpers_c.pyx", line 1488, in _dirstate_helpers_c._iter_next
  File "/home/xxxx/opt/bzr1.9/lib/python/bzrlib/osutils.py", line 1313, in _walkdirs_utf8
    dirblock = sorted(read_dir(relroot, top))
  File "_readdir_pyx.pyx", line 229, in _readdir_pyx.UTF8DirReader.read_dir
  File "_readdir_pyx.pyx", line 339, in _readdir_pyx._read_dir
OSError: [Errno 14] Bad address

0.235 return code 3
-----------

Attached you can find the corresponding truss log.

Regards,
Harry

Revision history for this message
John A Meinel (jameinel) wrote :

Attached is a patch which should fix this.

Is it possible for you to try it (or the associated branch) and let us know?

Revision history for this message
Harry Hirsch (bzr-unbunt) wrote :

Unfortunatly the chdir_empty.patch doesn't solve the problem. Instead the attached patch solves it.

The problem really was the getcwd() with Linux specific (or better non-POSIX) semantics.

In Solaris-getcwd() the size argument seems to always specify the size of the buffer in which the path name is stored. Contrary to the linux implementation it does not allocate memory as big as necessary if size is zero but instead, if the buf argument is NULL, it allocates a zero-size buffer.

So on Solaris this call to getcwd(NULL, 0) always failed and since the error wasn't caught the chdir()-call in the finally block was made with a zero-length buffer as argument.

The patch replaces the offending getcwd() call with a call that manually allocates a large enough buffer, which is passed as argument to getcwd() (passing NULL to the buf argument seems to be a POSIX extension).

Also it adds error handling to the getcwd() and the corresponding malloc().

Regards,
Harry

Revision history for this message
Martin Pool (mbp) wrote :

Thanks for identifying the problem and for the patch, Harry.

As it stands that patch can leak the allocated memory if one of the operations fails.

It seems to me it would actually be easier to open a file descriptor on '.' before leaving it and then fchdir back to it at the end. That would possibly be faster than copying the names around.

It would be even better to not bother with going back to the original location until we're done scanning all the directories but that might be a larger change.

Revision history for this message
Martin Pool (mbp) wrote : [merge][#297831] use open/fchdir to save and restore cwd

Harry Hirsch points out in
<https://bugs.edge.launchpad.net/bzr/+bug/297831> that we were relying
on a gnu extension (I think) to getcwd, which does not work on SunOS.
This avoids it by using fchdir, and also encloses John's previous fix.

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message
Martin Pool (mbp) wrote :
Changed in bzr:
status: Incomplete → In Progress
Revision history for this message
Martin Pool (mbp) wrote :
Revision history for this message
Harry Hirsch (bzr-unbunt) wrote :

Confirmed. Works for me. Thanks.

Regards,
Harry

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Pool wrote:

+ while entry != NULL:
+ # Unlike most libc functions, readdir needs errno set to 0
+ # beforehand so that eof can be distinguished from
errors. See
+ # <https://bugs.launchpad.net/bzr/+bug/279381>
+ while True:
+ errno = 0;
+ entry = readdir(the_dir)
+

^- We don't need the semicolon at the end of "errno = 0" (I realize you
probably didn't put it there.)

...

+ if -1 != orig_dir_fd:
+ if -1 == fchdir(orig_dir_fd):
+ raise OSError(errno, strerror(errno))
+ if -1 == close(orig_dir_fd):
+ raise OSError(errno, strerror(errno))
+

^- Do we want to raise before we close the orig_dir_fd?

BB:approve

(I have confirmed that it works on cygwin at least.)

I'm a tiny bit concerned about the implications on cygwin, in that we
are holding open a file handle, but I'm not very concerned.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkklgN0ACgkQJdeBCYSNAAMZ6ACfblKjYVAllL4fsqztQ5gSt/wc
VMgAoJe+J84Ncm/MfZjXci9cCMWiiEX5
=6kwY
-----END PGP SIGNATURE-----

Revision history for this message
Martin Pool (mbp) wrote :

On Fri, Nov 21, 2008 at 2:23 AM, John Arbash Meinel
<email address hidden> wrote:
> + while entry != NULL:
> + # Unlike most libc functions, readdir needs errno set to 0
> + # beforehand so that eof can be distinguished from
> errors. See
> + # <https://bugs.launchpad.net/bzr/+bug/279381>
> + while True:
> + errno = 0;
> + entry = readdir(the_dir)
> +
>
> ^- We don't need the semicolon at the end of "errno = 0" (I realize you
> probably didn't put it there.)

Fixed.

> + if -1 != orig_dir_fd:
> + if -1 == fchdir(orig_dir_fd):
> + raise OSError(errno, strerror(errno))
> + if -1 == close(orig_dir_fd):
> + raise OSError(errno, strerror(errno))
> +
>
> ^- Do we want to raise before we close the orig_dir_fd?

Yes.

> BB:approve
>
> (I have confirmed that it works on cygwin at least.)
>
> I'm a tiny bit concerned about the implications on cygwin, in that we
> are holding open a file handle, but I'm not very concerned.

Because of a limited number of file handles, or ...?

I would not be surprised if fchdir can't be done locally and needs to
be emulated in some way in cygwin. Can we use the _walkdirs_win32
from inside cygwin?

Anyhow, I've sent it to pqm.

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message
Martin Pool (mbp) wrote :

Should be fixed in 1.10rc1.

Changed in bzr:
milestone: none → 1.10rc1
status: In Progress → Fix Released
Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Pool wrote:

> Because of a limited number of file handles, or ...?

because holding a file handle implicitly locks all of the containing
directories.

So if you open "C:\foo\bar\baz.txt" all of "foo", "bar", and "baz.txt"
are locked and cannot be renamed, etc.

And if we get a failure and fail to close the file handle, then we've
leaked the file handle and path locks.

It will all be cleared up when the process exits.

>
> I would not be surprised if fchdir can't be done locally and needs to
> be emulated in some way in cygwin. Can we use the _walkdirs_win32
> from inside cygwin?
>
> Anyhow, I've sent it to pqm.
>

Probably some way we could get there, but it doesn't play nicely with
how cygwin manages the Executable bit, etc.

Certainly it would be nice, doing "bzr status" under cygwin takes 2+s on
a bzr.dev tree, while it takes only 0.5s under Win32 (and 0.8s if I
disable the _walkdirs_win32 extension.)

Obviously there is a lot of overhead in using cygwin, which is why *I*
stopped a long time ago.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkm+QYACgkQJdeBCYSNAAP+qgCeJdyEMqtSy8N5wvGhjLb8D510
TUcAn2aR6g928X7KTOQOy1ecufqI9ubM
=3eph
-----END PGP SIGNATURE-----

Revision history for this message
Bonnie (lucas-usna) wrote : RE: problems running bzr

I just wanted to update you. I have tried completely reinstalling from
the OS to the versions ( I have tried a number of versions) of python,
the dependencies for bazaar and bazaar itself. I still have the same error.

[errno 14] Bad address

We were planning on using bazaar for the spring semester. Right now those
plans are on hold until we can find out how to solve this problem.

Thanks - Bonnie

-----Original Message-----
From: <email address hidden> [mailto:<email address hidden>] On Behalf Of Martin
Pool
Sent: Thursday, November 13, 2008 5:51 PM
To: Bonnie Lucas
Cc: <email address hidden>; Christopher W. Brown;
<email address hidden>
Subject: Re: problems running bzr

And one more thing: we did fix a bug in this routine in bzr 1.8 (bug
279831
), not producing quite the same symptoms, but it might be
related. So it'd be good if you could upgrade to either 1.8 or 1.9,
if you're not already there.

Thanks,
Martin

Revision history for this message
Vincent Ladeuil (vila) wrote : Re: [Bug 297831] RE: problems running bzr

>>>>> "Bonnie" writes:

    Bonnie> I just wanted to update you. I have tried completely reinstalling from
    Bonnie> the OS to the versions ( I have tried a number of versions) of python,
    Bonnie> the dependencies for bazaar and bazaar itself. I still have the same error.

    Bonnie> [errno 14] Bad address

Can you be more specific about the versions you're using:

  uname -a
  bzr version

Will tell us a lot.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.