UnicodeEncodeError on commit with non-BMP Unicode characters

Bug #317644 reported by Wesley J. Landaker
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Bazaar
Fix Released
Medium
Unassigned

Bug Description

Update: I reported this against bzr-svn, but it's a bzr bug. I've included a smaller, simpler reproduction recipe in a later comment and updated the summary of this bug.

I get the following crash when trying to branch a valid, but generated test SVN repository.

The actual site of the crash is in bzrlib, but I've only been able (so far) to make this happen with generated SVN repositories, not with generated bzr repositories.

$ bzr branch repos bzr-branch
bzr: ERROR: exceptions.UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

Traceback (most recent call last):
  File "/usr/lib/python2.5/site-packages/bzrlib/commands.py", line 893, in run_bzr_catch_errors
    return run_bzr(argv)
  File "/usr/lib/python2.5/site-packages/bzrlib/commands.py", line 839, in run_bzr
    ret = run(*run_argv)
  File "/usr/lib/python2.5/site-packages/bzrlib/commands.py", line 539, in run_argv_aliases
    return self.run(**all_cmd_args)
  File "/usr/lib/python2.5/site-packages/bzrlib/builtins.py", line 1043, in run
    source_branch=br_from)
  File "/home/wjlanda/.bazaar/plugins/svn/remote.py", line 65, in sprout
    return super(SvnRemoteAccess, self).sprout(*args, **kwargs)
  File "/usr/lib/python2.5/site-packages/bzrlib/bzrdir.py", line 1103, in sprout
    result_repo.fetch(source_repository, revision_id=revision_id)
  File "/usr/lib/python2.5/site-packages/bzrlib/repository.py", line 1118, in fetch
    find_ghosts=find_ghosts)
  File "/home/wjlanda/.bazaar/plugins/svn/fetch.py", line 973, in fetch
    self._fetch_revisions(needed, pb, use_replay=use_replay)
  File "/home/wjlanda/.bazaar/plugins/svn/fetch.py", line 910, in _fetch_revisions
    parent_revmeta)
  File "/home/wjlanda/.bazaar/plugins/svn/fetch.py", line 866, in _fetch_revision_switch
    report_inventory_contents(reporter, parent_revnum, start_empty)
  File "/home/wjlanda/.bazaar/plugins/svn/fetch.py", line 703, in report_inventory_contents
    reporter.finish()
  File "/home/wjlanda/.bazaar/plugins/svn/fetch.py", line 196, in close
    self._close()
  File "/home/wjlanda/.bazaar/plugins/svn/fetch.py", line 331, in _close
    self.editor._finish_commit()
  File "/home/wjlanda/.bazaar/plugins/svn/fetch.py", line 515, in _finish_commit
    self.inventory, rev.parent_ids)
  File "/usr/lib/python2.5/site-packages/bzrlib/repository.py", line 670, in add_inventory
    inv_lines = self._serialise_inventory_to_lines(inv)
  File "/usr/lib/python2.5/site-packages/bzrlib/repository.py", line 1714, in _serialise_inventory_to_lines
    return self._serializer.write_inventory_to_lines(inv)
  File "/usr/lib/python2.5/site-packages/bzrlib/xml8.py", line 200, in write_inventory_to_lines
    return self.write_inventory(inv, None)
  File "/usr/lib/python2.5/site-packages/bzrlib/xml8.py", line 258, in write_inventory
    _encode_and_escape(ie.name),
  File "/usr/lib/python2.5/site-packages/bzrlib/xml8.py", line 104, in _encode_and_escape
    unicode_or_utf8_str)) + '"'
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

bzr 1.12dev on python 2.5.2 (linux2)
arguments: ['/usr/bin/bzr', 'branch', 'repos', 'bzr-branch']
encoding: 'UTF-8', fsenc: 'UTF-8', lang: 'en_US.UTF-8'
plugins:
  bzrtools /usr/lib/python2.5/site-packages/bzrlib/plugins/bzrtools [1.11]
  cvsps_import /usr/lib/python2.5/site-packages/bzrlib/plugins/cvsps_import [unknown]
  gtk /usr/lib/python2.5/site-packages/bzrlib/plugins/gtk [0.96.0.dev.1]
  launchpad /usr/lib/python2.5/site-packages/bzrlib/plugins/launchpad [unknown]
  loom /usr/lib/python2.5/site-packages/bzrlib/plugins/loom [1.4dev]
  netrc_credential_store /usr/lib/python2.5/site-packages/bzrlib/plugins/netrc_credential_store [unknown]
  rebase /usr/lib/python2.5/site-packages/bzrlib/plugins/rebase [0.3]
  search /usr/lib/python2.5/site-packages/bzrlib/plugins/search [1.6.0.dev.3]
  stats /usr/lib/python2.5/site-packages/bzrlib/plugins/stats [unknown]
  svn /home/wjlanda/.bazaar/plugins/svn [0.5rc1]
  upload /usr/lib/python2.5/site-packages/bzrlib/plugins/upload [0.1]
*** Bazaar has encountered an internal error.
    Please report a bug at https://bugs.launchpad.net/bzr/+filebug
    including this traceback, and a description of what you
    were doing when the error occurred.

I've attached the repository. It's generated by a randomized script, so doesn't have any semantic meaning, but it is valid.

Related branches

Revision history for this message
Wesley J. Landaker (wjl) wrote :
Revision history for this message
Jelmer Vernooij (jelmer) wrote :

bzr itself doesn't accept these filenames either, so it's not related to bzr-svn.

E.g. try adding a file name u'\U0005d062\U000df631' to bzr.

Revision history for this message
Wesley J. Landaker (wjl) wrote :

Okay, I see you are right. (Bzr actually will let you "bzr add" them, but dies if you try to "bzr commit" them. I just didn't go far enough.)

It looks like for some reason, it's trying to decode UTF-8 with the ascii codec, i.e.:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

Revision history for this message
Wesley J. Landaker (wjl) wrote :

Or rather, it is trying to encode an Unicode string to ASCII, instead of UTF-8.

Revision history for this message
Wesley J. Landaker (wjl) wrote :
Download full text (4.4 KiB)

I simplier reproduction recipe only involving bzr and Unicode character U+20000 CJK UNIFIED IDEOGRAPH-20000.

# note that the locale is UTF-8
$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en_GB:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
$ touch 𠀀
# The file is correctly in UTF-8:
$ ls | xxd
0000000: f0a0 8080 0a
$ bzr init
bzr add
Standalone tree (format: pack-0.92)
Location:
  branch root: .
# Adding works fine
$ bzr add
added "𠀀"
# But committing fails
$ bzr commit -m "Adding file"
Committing to: /tmp/bzr-bug-317644/
added 𠀀
aborting commit write group: UnicodeEncodeError('ascii', u'\U00020000', 0, 1, 'ordinal not in range(128)')
bzr: ERROR: exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\U00020000' in position 0: ordinal not in range(128)

Traceback (most recent call last):
  File "/usr/lib/python2.5/site-packages/bzrlib/commands.py", line 893, in run_bzr_catch_errors
    return run_bzr(argv)
  File "/usr/lib/python2.5/site-packages/bzrlib/commands.py", line 839, in run_bzr
    ret = run(*run_argv)
  File "/usr/lib/python2.5/site-packages/bzrlib/commands.py", line 539, in run_argv_aliases
    return self.run(**all_cmd_args)
  File "/usr/lib/python2.5/site-packages/bzrlib/builtins.py", line 2507, in run
    exclude=safe_relpath_files(tree, exclude))
  File "/usr/lib/python2.5/site-packages/bzrlib/decorators.py", line 192, in write_locked
    result = unbound(self, *args, **kwargs)
  File "/usr/lib/python2.5/site-packages/bzrlib/workingtree_4.py", line 224, in commit
    result = WorkingTree3.commit(self, message, revprops, *args, **kwargs)
  File "/usr/lib/python2.5/site-packages/bzrlib/decorators.py", line 192, in write_locked
    result = unbound(self, *args, **kwargs)
  File "/usr/lib/python2.5/site-packages/bzrlib/mutabletree.py", line 208, in commit
    *args, **kwargs)
  File "/usr/lib/python2.5/site-packages/bzrlib/commit.py", line 376, in commit
    self.builder.finish_inventory()
  File "/usr/lib/python2.5/site-packages/bzrlib/repository.py", line 192, in finish_inventory
    self.parents
  File "/usr/lib/python2.5/site-packages/bzrlib/repository.py", line 670, in add_inventory
    inv_lines = self._serialise_inventory_to_lines(inv)
  File "/usr/lib/python2.5/site-packages/bzrlib/repository.py", line 1714, in _serialise_inventory_to_lines
    return self._serializer.write_inventory_to_lines(inv)
  File "/usr/lib/python2.5/site-packages/bzrlib/xml8.py", line 200, in write_inventory_to_lines
    return self.write_inventory(inv, None)
  File "/usr/lib/python2.5/site-packages/bzrlib/xml8.py", line 246, in write_inventory
    _encode_and_escape(ie.name), parent_str, parent_id,
  File "/usr/lib/python2.5/site-packages/bzrlib/xml8.py", line 104, in _encode_and_escape
    unicode_or_utf8_str)) + '"'
UnicodeEncodeError: 'ascii' codec can't encode character u'\U00020000' in position 0: ordinal not in range(128)

bzr 1.12dev on python 2.5.2 (linux2)
argum...

Read more...

description: updated
Revision history for this message
Jelmer Vernooij (jelmer) wrote : Re: [Bug 317644] Re: UnicodeEncodeError with generated test SVN repository

  status triaged
  importance medium
--
Jelmer Vernooij <email address hidden> - http://samba.org/~jelmer/
Jabber: <email address hidden>

Changed in bzr:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Wesley J. Landaker (wjl) wrote :

The problem seems to be here in bzrlib/xml8.py:
_unicode_re = re.compile(u'[&<>\'\"\u0080-\uffff]')

That doesn't cover all of Unicode, it just covers the BMP. This should be the following instead:
_unicode_re = re.compile(u'[&<>\'\"\u0080-\U0010ffff]')

Making that change in my tree fixes the problem for me.

Revision history for this message
Wesley J. Landaker (wjl) wrote :

Here is the patch I mentioned as a bzr bundle.

Revision history for this message
John A Meinel (jameinel) wrote :

As mentioned in the merge request, it turns out to not be that simple.

It breaks down into

1) Unicode strings are UCS-4 on Linux and UCS-2 on Windows.
2) On Linux, _unicode_re = re.compile(u'[&<>\'\"\u0080-\U0010ffff]') seems to match u'\U0010abcd' just fine.
   On Windows, it only matches the first character of the 2-character form, and fails to encode.
3) As such, we *could* change the regex based on platform, but that causes other problems based on having different platforms generating different inventory texts.
4) I believe the next repo format (--development6-rich-root) will not suffer from this, because it doesn't use XML, and will just represent the paths as UTF-8.

Can someone test with --development6-rich-root, and see if commit, etc work fine? If so, I'd rather not mess around with the XML escaping, because Unicode is just a big pile of incompatibilities (from different implementers), so I'd rather stick with what has worked for us for the last couple of years.

Martin Pool (mbp)
Changed in bzr:
status: Triaged → Confirmed
Revision history for this message
Jelmer Vernooij (jelmer) wrote :

This works fine with CHK repositories (bzr 2.0 and later), so I'm going to mark this fix released.

Changed in bzr:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.