We should support utf8-invalid filenames

Bug #368626 reported by Facundo Batista
410
This bug affects 62 people
Affects Status Importance Assigned to Milestone
Ubuntu One Client
Fix Released
High
Ubuntu One Foundations+ team
ubuntuone-client (Ubuntu)
Fix Released
Medium
Ubuntu One Foundations+ team

Bug Description

If the user creates a file with a name that is not utf8 valid, we should ignore it (and alert the user that he has a problem there, for him to solve it)

You can find invalid filenames using the following script: http://people.canonical.com/~roman.yepishev/ubuntuone-scripts/utf8-filename-check.py

Related branches

John Lenton (chipaca)
Changed in ubunet:
assignee: nobody → Lucio Torre (lucio.torre)
status: Confirmed → Triaged
tags: added: foundations+
John Lenton (chipaca)
affects: ubunet → ubuntuone-client
visibility: private → public
Revision history for this message
Paul Sladen (sladen) wrote :

You probably want to stick with UTF-8 in ubunet and simply validate that any client software submissions to the database (via u1-storage-protocol, or updown/webui) *are* valid UTF-8.. and leave it up to the client applications to perform any double encoding.

Making any changes to let badly formatted string in now, will likely cause trouble later.

tags: added: facundo-lucid
tags: added: chicharra-lucid-problems
removed: facundo-lucid
Revision history for this message
Facundo Batista (facundo) wrote :

Because of the problem in #481409. we understood that we need to validate utf8 asap (_InotifyProcessor and LocalRescan), so there's no further point that will handle wrong strings (in that bug, FSM had the bad path, and that caused a dbus message to crash).

tags: added: u1-lucid
Revision history for this message
fghoche (fg-launchpad) wrote :

Hello,
After several crashes, I have filed the following bugs: Bug #502717, Bug #502900, Bug #503300.
The first one has been marked as duplicate to this one.
I presume the other ones would be duplicates as well?
Is there any solution in sight for this problem?
Or else, is there a tool to help spot those files so that I can change their names?
I have to sync data between Linux, Windows and Mac computers, and not always controlling the naming. I have been correcting files names when the showed in Nautilus or when an app couldn't load them. The duplicates after sync is another issue I have to solve. But this another story...
I habe subscribed to the 50 GB U1 program but have succeeded to date to upload only 680 MB...
TIA for any help

Revision history for this message
Roman Yepishev (rye) wrote :

Hi fghoche,
You may try the following script:
http://ubuntuone-client-diagnose.googlecode.com/svn/trunk/ubuntuone-client-diagnose.py

It should report all invalid file/directory names in your Ubuntu One directory in case it finds any error related to invalid utf-8 names in UbuntuOne logs.

Revision history for this message
fghoche (fg-launchpad) wrote :

Thanks for your reply Roman.
I have downloaded the script and run it.
I have a lot more dirs and files with problems than I thought.
It looks like all the Windows accented chars in names are not in UTF8. Many of these are hard-coded in the French version of W I am using.
Is there a simple way to solve this?
I mean, I can go through all those names manually in my Ubuntu netbook and change the accented chars, but then I will have a problem to sync with my Windows boxes...
Any hints?

Roman Yepishev (rye)
tags: added: rye-diag
Revision history for this message
Guillermo Gonzalez (verterok) wrote :

Hi,

Bazaar face a similar (probably more complex) issue regarding encoding, see Bug #77657

Changed in ubuntuone-client:
status: Triaged → Fix Committed
assignee: Lucio Torre (lucio.torre) → nobody
assignee: nobody → Bongcaivang (bongcaivang)
Revision history for this message
Facundo Batista (facundo) wrote :

There are three cases to address:

a) User has a filename that SD can not convert to Unicode (decoding it using the user's filesystem encoding).

b) SD receives from the server a Unicode filename that can not convert to bytes (encoding it using the user's filesystem encoding).

c) SD receives from the server a filename that can not save locally because of filesystem restrictions.

In case a), we convert it replacing "high" bytes to create a name that will not overlap other similar names.

Example:

    Client 1 (using utf8): "Hola \xff"
    Server: u'Hola %FF' ("Hola %FF")
    Client 2 (using utf8) "Hola %FF" ("Hola %FF")

In case b), we encode to bytes safely (using UTF-8) and then apply the same conversion as before.

Example:

    Client 1 (using utf8): "Pi: \xcf\x80"
    Server: u"Pi: \u03c0" ("Pi: Ï€")
    Client 2 (using latin1) "Pi: %CF%80"

In case c), we translate it using a local table that will depend of the Ubuntu One client prepared for that filesystem.

Example:

    Client 1 (ext3 filesystem): "*.txt"
    Server: u"*.txt"
    Client 2 (ntfs filesystem) "star.txt"

All cases are addressed in today's Unicode boundary, where locally (filesystem and metadata) we have bytes, and in the protocol and server we use Unicode.

The translations are not reversible, and are done once: if the translation was needed a "server_name" field will be filled in the local metadata with the name of the server side, and the path will have the local names (no matter which one is the original one).

Those who compares server and local names (e.g.: Sync.merge_directory()) need to use this server_name when available.

Changed in ubuntuone-client:
status: Fix Committed → Triaged
assignee: Bongcaivang (bongcaivang) → Facundo Batista (facundo)
Revision history for this message
Roman Yepishev (rye) wrote :

Attaching the script to find broken UTF-8 filenames.

Changed in ubuntuone-client:
assignee: Facundo Batista (facundo) → Ubuntu One Foundations+ team (ubuntuone-foundations+)
Revision history for this message
Natalia Bidart (nataliabidart) wrote :

I think that this solution is complex to implement (it's very clever, but complex at coding level), and I'm not sure it worths it given the amount of users who can be suffering from this issue.

What worries me the most is:

 * the translation table/function needs a lot of care and design and test, it can be very easy to make a tiny mistake. How long will it takes us to stabilize it?

 * the end users will start reporting that "the file names were messed up" by Ubuntu One. Usually, end users are not aware of encodings, and much less of the possible encoding issues, and if they see the same file with different name they'll report that as a bug. Likewise, they may think that a file disappeared, and actually the file may have been renamed.

Revision history for this message
Roman Yepishev (rye) wrote :

I'd rather have my filenames left as is.
I believe some script built to perform the "migration" from old encoding to new utf-8 might be a better alternative.
I have some filenames living here from koi8-r era of my linux installation and doing iconv-things is not user-friendly.
Maybe a cmdline app and GUI for it for those that have no idea why some files they have are invalid.

But the first thing to do in this case is to notify the user that not all of his files can be synced. Skipping that silently/writing a warning to log only will cause more bug reports about incomplete syncing.

Revision history for this message
Natalia Bidart (nataliabidart) wrote :

I agree with Roman that we should notify the end user.

Revision history for this message
Roman Yepishev (rye) wrote :

If we introduce some sort of dbus signal - "Hey, I found a bad file, won't sync. No-no-no!" then that would open the possibility to create some diag tools that would notify the end user in a slightly better manner (even in Lucid, with custom scripts).

invalid utf-8 file name howto:

$ touch `echo файл.txt | iconv -f utf-8 -t koi8-r`

"файл" means file in Russian FWIW.

Roman Yepishev (rye)
description: updated
Revision history for this message
Facundo Batista (facundo) wrote :

After latest comments we discussed it again and reached the following conclusion: these files or directories will be ignored.

If a filesystem event with a non-utf8 name is received, the following actions will be taken:

- Send a dbus signal to alert that it is being ignored with the following info:

      - dirpath (unicode): the directory where the file or dir is located
      - ignored_path (bytes): the ignored file or dir name

- Log this ignore in a specific log file.

- Stop further event processing for this file/dir.

This will allow us to behave better for Lucid, and revisit later this decision when more variables were set.

tags: added: package
Changed in ubuntuone-client:
milestone: none → lucid-beta-2
Changed in ubuntuone-client (Ubuntu):
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Ubuntu One Foundations+ team (ubuntuone-foundations+)
milestone: none → ubuntu-10.04-beta-2
Changed in ubuntuone-client:
status: Triaged → In Progress
Changed in ubuntuone-client (Ubuntu):
status: Triaged → In Progress
Changed in ubuntuone-client (Ubuntu):
status: In Progress → Fix Committed
Changed in ubuntuone-client:
status: In Progress → Fix Committed
Changed in ubuntuone-client (Ubuntu):
status: Fix Committed → Fix Released
tags: removed: package
Changed in ubuntuone-client:
status: Fix Committed → Fix Released
Revision history for this message
Matteo Settenvini (tchernobog) wrote :

This bug is back for me in Natty, maybe it appeared again after the transition to Python 2.7.
I have a folder in my Ubuntu One space called "Blekinge Tekniska Högskola" (please note the umlaut).

Now Ubuntu One is not synchronizing at all, because it bails out with:

matteo@orchid:~/.cache$ /usr/lib/ubuntuone-client/ubuntuone-syncdaemon/usr/lib/pymodules/python2.7/ubuntuone/syncdaemon/filesystem_manager.py:581: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if path[:len_base] == base_path and sep not in path[len_base:]:
Unhandled error in Deferred:
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/ubuntuone/syncdaemon/local_rescan.py", line 337, in _scan_tree
    d.addCallbacks(self._scan_one_dir)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 249, in addCallbacks
    self._runCallbacks()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 441, in _runCallbacks
    self.result = callback(self.result, *args, **kw)
  File "/usr/lib/pymodules/python2.7/ubuntuone/syncdaemon/local_rescan.py", line 679, in _scan_one_dir
    d = defer.execute(scan)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 96, in execute
    result = callable(*args, **kw)
  File "/usr/lib/pymodules/python2.7/ubuntuone/syncdaemon/local_rescan.py", line 639, in scan
    share)
  File "/usr/lib/pymodules/python2.7/ubuntuone/syncdaemon/local_rescan.py", line 528, in _compare
    objs = self.fsm.get_mdobjs_by_share_id(share.volume_id, fullname)
  File "/usr/lib/pymodules/python2.7/ubuntuone/syncdaemon/filesystem_manager.py", line 567, in get_mdobjs_by_share_id
    if path.startswith(base_path):
exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 43: ordinal not in range(128)

My guess is that some variable which should be unicode-aware (base_path, by the look of it?) is in fact passed a bytestring.
I tried to debug this a little, but since I am not a Python developer I stopped after hitting that line 567 in filesystem_manager.py with a breakpoint.

I have other filenames with UTF-8 characters, so renaming all of them would be quite a pain. Plus, it is supposed to work, right?
I am using the it_IT.UTF8 locale, if that helps.

Revision history for this message
Facundo Batista (facundo) wrote :

Matteo: No, you're suffering #696901.

Regards,

Revision history for this message
Matteo Settenvini (tchernobog) wrote :

What is bug #696901 about? Is there a workaround? It says I cannot access that page; I guess it is a private bug.

Revision history for this message
Facundo Batista (facundo) wrote :

Matteo, yes, OP put it as private, sorry.

It's: "LR is putting non-utf8 paths into SD"

Yes, there's a workaround, try to find a folder or file in your disk with a path that is not utf8 valid.

I can help you if you post the logs in debug mode, or come to IRC, to #ubuntuone in FreeNode, for more interactive help.

Thanks

Revision history for this message
Facundo Batista (facundo) wrote :

So, we followed this by IRC, the issue was a problem in metadata. It's fixed now.

Revision history for this message
hanzz (hans-prueller) wrote :

I am STILL having this bug on NATTY:
===

hansp@kodos:~$ u1sdtool --waiting-content

Oops, an error ocurred:
Traceback (most recent call last):
Failure: dbus.exceptions.DBusException: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
hansp@kodos:~$ u1sdtool --waiting-content

Oops, an error ocurred:
Traceback (most recent call last):
Failure: dbus.exceptions.DBusException: org.freedesktop.DBus.Python.UnicodeError: Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.6/dbus/service.py", line 745, in _message_cb
    _method_reply_return(connection, message, method_name, signature, *retval)
  File "/usr/lib/pymodules/python2.6/dbus/service.py", line 252, in _method_reply_return
    reply.append(signature=signature, *retval)
UnicodeError: String parameters to be sent over D-Bus must be valid UTF-8

Revision history for this message
hanzz (hans-prueller) wrote :

... forgot to mention that somehow it seems that this bug STOPS ubuntu-one from SYNCING.

It has the status "working on content" for weeks now - even if I do not turn off my computer. this really breaks everything!

Revision history for this message
Roman Yepishev (rye) wrote :

hanzz, could you please check what version of ubuntuone client are you running?
Additionally, could you please try running u1sdtool --waiting (in new syncdaemons the content/meta queues are merged)

Revision history for this message
hanzz (hans-prueller) wrote :

hi,

i have checked it using synaptic:

ubuntuone-client version = 1.4.6-0ubuntu2

running "u1sdtool --waiting" fails --obviously this is not supported.

do I have a wrong version of ubuntuone installed ?

Revision history for this message
Roman Yepishev (rye) wrote :

hanzz, yes, it looks like you have an outdated installation of ubuntuone if you are using current Natty.
For Natty Narwhal 11.04 the latest released client version is 1.5.8-0ubuntu2

Revision history for this message
hanzz (hans-prueller) wrote :

sorry I am confused - i have experimented with natty on another machine but the machine that makes the problems and with the client version 1.4.6-0ubuntu2 is 10.10 maverick ... sorry for that.

Revision history for this message
Facundo Batista (facundo) wrote :

hanzz: I think that you're suffering from bug #561638 (note that there it talks about --waiting-meta, but surely --waiting-content and maybe --waiting are affected too.

Revision history for this message
hanzz (hans-prueller) wrote :

I'm not sure if this is affecting me.

that --waiting-metadata or --waiting fails is a minor issue, my MAIN issue is that it seems to break sync in general!

my laptop is up and running for DAYS and sync does not complete - status always "working on content" so I guess that also the syncdaemon is affected by this issue?

Revision history for this message
hanzz (hans-prueller) wrote :

meanwhile I upgraded to the latest ubuntuone client package from the NIGHTLY BUILDS as described here:

https://wiki.ubuntu.com/UbuntuOne/FAQ/HowDoIInstallNightlies

but - I STILL have the problem :

hansp@kodos:~$ u1sdtool --waiting

Oops, an error ocurred:
Traceback (most recent call last):
Failure: dbus.exceptions.DBusException: org.freedesktop.DBus.Python.UnicodeError: Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.6/dbus/service.py", line 745, in _message_cb
    _method_reply_return(connection, message, method_name, signature, *retval)
  File "/usr/lib/pymodules/python2.6/dbus/service.py", line 252, in _method_reply_return
    reply.append(signature=signature, *retval)
UnicodeError: String parameters to be sent over D-Bus must be valid UTF-8

Revision history for this message
Facundo Batista (facundo) wrote :

hanzz, you're commenting in a bug that is on other issue; the problem you show in the comment is covered in bug #561638, if you have other problem (like you say about sync broken), please open a new bug and attach logs in debug mode:

1. stop the syncdaemon client ("u1sdtool --quit") and be sure it's fully stopped ("ps -eaf | grep ubuntuone-client" should give you nothing).

2. put a file named syncdaemon.conf in your $HOME/.config/ubuntuone directory with the following information:

[logging]
level = DEBUG

3. restart the client.

4. attach here the logs, just zip your $HOME/.cache/ubuntuone/log/ folder and attach the zip here.

Keep commenting on this bug is not the best way to get attention.

Thanks for your time and help!

Revision history for this message
Martin Pool (mbp) wrote :

Hans, and others still seeing these symptoms or being redirected to this bug on natty: you may be affected by bug 807671. Read or subscribe to that instead. Again the answer is to run the script and fix up any invalid names.

To post a comment you must log in.