Handling of names in UTF-8 (Unicode)

Bug #238365 reported by Daniel Clemente
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Bazaar Fast Import
Fix Released
Medium
Daniel Clemente

Bug Description

bzr fast-import fails while importing a repository where there are symbolic links pointing to non-ASCII names.

* How to reproduce it:

First make sure that your locale is UTF-8. Following command should display 2: echo -n é | wc -c

Then do:

mkdir tres
cd tres
git init
touch més
ln -s més prova
git add més prova
git commit -a -m "link to a file with a name in utf-8"
git-fast-export --all >expo
mkdir enbzr
cd enbzr
bzr init
bzr fast-import ../expo

* The result:

Traceback (most recent call last):
  File "/usr/lib/python2.5/site-packages/bzrlib/commands.py", line 846, in run_bzr_catch_errors
    return run_bzr(argv)
  File "/usr/lib/python2.5/site-packages/bzrlib/commands.py", line 797, in run_bzr
    ret = run(*run_argv)
  File "/usr/lib/python2.5/site-packages/bzrlib/commands.py", line 499, in run_argv_aliases
    return self.run(**all_cmd_args)
  File "/home/dc/.bazaar/plugins/fastimport_dev/__init__.py", line 166, in run
  File "/home/dc/.bazaar/plugins/fastimport_dev/__init__.py", line 50, in _run
  File "/home/dc/.bazaar/plugins/fastimport/processor.py", line 83, in process
    self._process(command_iter)
  File "/home/dc/.bazaar/plugins/fastimport/processors/generic_processor.py", line 251, in _process
    processor.ImportProcessor._process(self, command_iter)
  File "/home/dc/.bazaar/plugins/fastimport/processor.py", line 105, in _process
    handler(self, cmd)
  File "/home/dc/.bazaar/plugins/fastimport/processors/generic_processor.py", line 413, in commit_handler
    handler.process()
  File "/home/dc/.bazaar/plugins/fastimport/processor.py", line 170, in process
    handler(self, fc)
  File "/home/dc/.bazaar/plugins/fastimport/processors/generic_processor.py", line 694, in modify_handler
    filecmd.is_executable, data)
  File "/home/dc/.bazaar/plugins/fastimport/processors/generic_processor.py", line 835, in _modify_inventory
    ie.symlink_target = data.encode('utf8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

bzr 1.5 on python 2.5.2 (linux2)
arguments: ['/usr/bin/bzr', 'fast-import', '../expo']
encoding: 'UTF-8', fsenc: 'UTF-8', lang: 'es_ES.UTF-8'
plugins:
  bzrtools /usr/lib/python2.5/site-packages/bzrlib/plugins/bzrtools [1.5.0]
  dbus /usr/lib/python2.5/site-packages/bzrlib/plugins/dbus [unknown]
  fastimport /home/dc/.bazaar/plugins/fastimport [unknown]
  gtk /usr/lib/python2.5/site-packages/bzrlib/plugins/gtk [0.94.0]
  launchpad /usr/lib/python2.5/site-packages/bzrlib/plugins/launchpad [unknown]
... Bazaar has encountered an internal error.
    Please report a bug at https://bugs.launchpad.net/bzr/+filebug
    including this traceback, and a description of what you
    were doing when the error occurred.

Tested with stable version of fastimport and also the one from fastimport.dev from today.

Related branches

Revision history for this message
Daniel Clemente (n142857) wrote :

This line:
  ie.symlink_target = data.encode('utf8')
could be changed to:
  ie.symlink_target = data
to prevent a failed conversion (¿from utf8 to utf8?) and to store the symlink really as it is. However, Bazaar doesn't support symlinks to Unicode (in fact, non-ASCII) filenames yet. This bug depends on bug 272444.

Revision history for this message
Daniel Clemente (n142857) wrote :

Now that bug 272444 is fixed, the change in comment 1 can be applied. I attach it as a patch. With it I could correctly import a repository that didn't work before.

Revision history for this message
Daniel Clemente (n142857) wrote :

Could the patch please be checked in?

Revision history for this message
Daniel Clemente (n142857) wrote :

While the previous patch worked before, now a decode("utf8") is needed in order to pass CHKInventory._entry_to_bytes a unicode object, not a str.

I attach an updated patch ready to check in.

To test this bug you can use this line:

cd /tmp; rm -rf tres enbzr; mkdir tres; cd tres; git init; touch més; ln -s més prova; git add més prova; git commit -a -m "link to a file with a name in utf-8"; git fast-export --all >expo; mkdir enbzr; cd enbzr; bzr init; bzr fast-import ../expo

Revision history for this message
Daniel Clemente (n142857) wrote :

With a larger repository I found another case that didn't work. Corrected in this new patch.

Revision history for this message
Daniel Clemente (n142857) wrote :

It seems there are more cases where a decode() may be needed, in particular with rename_item. I attach a new patch, but a new branch would be better since there may be many other changes.
I don't know if adding decode("utf-8") is the correct approach. With this patch v4 I could get further in the conversion of a large branch.

Revision history for this message
Daniel Clemente (n142857) wrote :

Sorry for so many patches -- I should use a branch. But with this one (v5) I got no Unicode errors exporting the biggest branch I have (3873 rev. with many error-prone names). It stopped in another error (bug #458260, also about file names).

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

Thanks Daniel. I'm looking forward to getting this sorted out. BTW, I tried this:

mkdir tres; cd tres; bzr init; touch més; ln -s més prova; bzr add més prova; bzr commit -m "link to a file with a name in utf-8"

and bzr appears to do the right thing. Running "bzr fast-export ." on the resulting branching falls over though. I suspect it needs a few tweaks like you've done on the import side.

So altogether, we need to get several things working:

1. import with unicode filenames and symlinks
2. export with unicode filenames and symlinks
3. "bzr fast-import-filter in.fi > out.fi" needs to produce an out.fi equivalent to the in.fi.

Could you put together a branch, apply your patch and push it to Launchpad? We can then work through these issues, add some tests and merge your fixes.

Revision history for this message
Daniel Clemente (n142857) wrote :

I created a branch at lp:~n142857/bzr-fastimport/unicode-symlinks
It has the 3 points you asked for: import and export work (I tried simple test cases for both and a complex test case for import), and the filter is neutral. As a plus, the current fast-import tests still pass.

About the 3rd point, fast-import-filter, I should mention two differences which I don't think are relevant, but just in case…:
1. git input file uses 100644 as a mode, but bzr exports 644
2. bzr produces one more blob than git in a "mv" operation. I can send a diff.

I used this script to do more complex testing with symlinks:
 cd /n; rm -rf quatre enbzr; mkdir quatre; cd quatre; git init; touch més; ln -s més prova; git add més prova; git commit -a -m "link to a file with a name in utf-8"; cp -l prova prova2; git add prova2; git commit -a -m "copied symlink"; git mv prova provab; git commit -a -m "moved symlink"; rm provab; touch més2; ln -s més2 provab; git commit -a -m "modified symlink destination"; git rm provab; git commit -a -m "deleted symlink"; git fast-export --all >expo; mkdir enbzr; cd enbzr; bzr init; bzr fast-import ../expo

But of course we need better tests.

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

Thanks, merged (with some tweaks). Sorry it took so long!

Changed in bzr-fastimport:
status: New → Fix Committed
importance: Undecided → Medium
assignee: nobody → Daniel Clemente (n142857)
Jelmer Vernooij (jelmer)
summary: - Symbolic links to files with names in UTF-8 (Unicode)
+ Handling of names in UTF-8 (Unicode)
Jelmer Vernooij (jelmer)
Changed in bzr-fastimport:
status: Fix Committed → Fix Released
milestone: none → 0.10.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.