Chinese file names in Zip Archives compressed on Windows cannot be extracted correctly

Bug #371167 reported by Aron Xu
72
This bug affects 13 people
Affects Status Importance Assigned to Milestone
File Roller
Unknown
Unknown
KDE Utilities
Invalid
Undecided
Unassigned
file-roller (Ubuntu)
Triaged
Low
Unassigned
kdeutils (Ubuntu)
Invalid
Undecided
Unassigned
p7zip (Ubuntu)
Confirmed
Undecided
Unassigned
unzip (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Hi,
I am a user from China, when one of my friend and I are preparing for helping more Chinese user migrate to Ubuntu or Kubuntu, we found file-roller and ark don't work properly as what we want to.
Usually we have many files that were compressed on Windows, with Chinese file names. When I tried to decompress one of them I found Ark cannot get the proper encoding of the file. On Windows, the file names are encoded as CP936 or GBK, but Ark regards them as UTF-8 by default, so we cannot see or get the files with right name but all strange characters. I tried to find why, then I know I have installed both p7zip and unzip on my Ubuntu, and Ark calls p7zip package to decompress .zip archive instead of the package 'unzip'. Then I purged the p7zip package and tried again, and the problem is still there, unsolved.

Thanks for your time to pay attention to this problem.
Some more information about my system and the softwares are listed here:

This problem was reported since Jaunty, and still persist in Karmic.

Tags: cjk i18n
fujianwzh (wzh)
Changed in ubuntu:
assignee: nobody → fujianwzh (fujianwzh)
status: New → Confirmed
fujianwzh (wzh)
Changed in ubuntu:
assignee: fujianwzh (fujianwzh) → nobody
Revision history for this message
Yannis Tsop (ogiannhs) wrote :

The same problems exists for other encodings too! I have this problem with greek file names, which cannot be extracted at all, but file-roller extracts them even though it does create file with weird names.

Aron Xu (happyaron)
affects: ubuntu → kdeutils (Ubuntu)
Changed in kdeutils (Ubuntu):
status: Confirmed → New
Aron Xu (happyaron)
tags: added: cjk i18n
Revision history for this message
Dominik Stadler (dominik-stadler) wrote :

Can you please attach one of the compressed files that causes problems.

Aron Xu (happyaron)
summary: - Ark cannot decompress Zip Archives with Chinese file name that
- compressed on Windows
+ Chinese file names in Zip Archives compressed on Windows cannot be
+ extracted correctly
description: updated
Revision history for this message
Aron Xu (happyaron) wrote :

This attachment is a file that can be used to reproduce this problem. 测试.zip is compressed on a Windows XP with Simplified Chinese environment, there are two empty files named 测试1.txt and 测试2.txt in it. When you decompress them with Ark or File-roller, we won't get correct file names.

Revision history for this message
Dominik Stadler (dominik-stadler) wrote :

I see the same thing here on Karmic 9.10 using the attached file.

/tmp$ unzip *.zip
Archive: 测试.zip
 extracting: ??-?1.txt
 extracting: ??-?2.txt

Changed in unzip (Ubuntu):
status: New → Confirmed
Revision history for this message
Pedro Villavicencio (pedro) wrote :

for the file-roller task, could somebody having the issue send it to bugzilla.gnome.org ? thanks.

Changed in file-roller (Ubuntu):
importance: Undecided → Low
Revision history for this message
Molnár Gábor (csirkus) wrote :

This file is compressed with a hungarian localized windows with iso-8859-2 character encoding, and it cannot be extracted using file-roller. The error message is caution: filename not matched: 01_KB_eln\?k.pdf . The name of the file should be 01_KB_elnök.pdf (with double accented o)

Changed in file-roller (Ubuntu):
status: New → Triaged
Revision history for this message
Frederik Elwert (frederik-elwert) wrote :

As stated in the GNOME bug report, installing 7zip-full fixes the problem. So this seems to be a problem of unzip not handling these characters correctly.

Revision history for this message
Dimitrios Dalagiorgos (dimndal) wrote :

I have a similar problem with filenames in Greek (Windows-1253)

Revision history for this message
Molnár Gábor (csirkus) wrote :

I reported the problem on the developer's website (http://www.info-zip.org/ according to the 'unzip' manual page) through a contact form (they dont have bug tracker) and suggested to cantact users through launchapd or gnome bugzilla. I hope they will fix this.

Revision history for this message
eumetaxas (eumetaxas) wrote :

I have a similar problem with google docs. When i download several docs in zip format the non-English chars ado not appear at all and the file names are ......................ods or .....................odt.
Is this a general issue of the zip format?
Thank you.

Revision history for this message
Jiahua Huang (huangjiahua) wrote :

This patch will fix it

Revision history for this message
Aron Xu (happyaron) wrote :

Can anybody have a look at this bug? Maybe we can use a patch (perhaps the above one, but not tested) to fix this problem. For non-English users this bug is something annoying, thanks.

Revision history for this message
Harald Sitter (apachelogger) wrote :

ark uses unzip and related tools from the unzip package, so closing the kdeutils tasks as invalid since this needs to be resolved in unzip to resolve it in ark alike.

Changed in kdeutils (Ubuntu):
status: New → Invalid
Changed in kdeutils:
importance: Unknown → Undecided
status: Unknown → New
status: New → Invalid
Revision history for this message
Jose M. Albarrán (yomismo-jmalbarran) wrote :

Same problem with spanish charset. I've reported this bug a couple of years ago, but now I cannot find my initial report. Maybe is a regresion or configuration problem.

Revision history for this message
Еггог (sergey-nr) wrote :

Same problem with russian charset. In ubuntu 8.04 this bug not present.

Revision history for this message
Dmitry Agafonov (dmitry-agafonov) wrote :

I guess we should make this bug as duplicate of https://bugs.launchpad.net/ubuntu/+source/unzip/+bug/477755

Revision history for this message
Andreas Heinlein (aheinlein) wrote :

This bug is not quite a duplicate, at least not for 7zip. In reply to comment #7, installing p7zip-full does not fully fix the problem. It does, however, change the behaviour of file-roller. Without p7zip-full, file-roller uses unzip and cannot extract files with special characters in their name. With p7zip-full installed, it obviously uses 7z and still shows weird characters in most cases, but they can be extracted. It does also not solve the problem the other way round (opening archives on windows which were created with file-roller), but changes behaviour here too.

"In most cases" means it also depends on the windows packer used. I did some cross-tests and found out really weird behaviour. Files created with Filzip work, files created with Winzip 14 or 7zip for Windows (sic!) do not. Interestingly enough, this does not work the other way round, i.e. Filzip for Windows cannot handle Archives correctly created with p7zip for Linux. Also, 7zip for Win can handle file-roller archives created with info-zip, but not those created with p7zip. 7zip for Win also cannot handle Filzip archives, but the other way round works. p7zip archives can be handled by Winzip 14, but again not the other way round.

Summary: There is currently no pair of Windows<->Linux programs I know of which can handle special characters in archives created by the other program. The real blocker is file-roller not being able top open/extract those files at all, which can be solved by installing p7zip-full. All other programs can open these files, though with a cluttered filename. Anyway, this needs to be fixed with some kind of encoding detection/guessing.

Revision history for this message
Alexander Melnichuk (ama-land) wrote :

Same thing here with unzipping archives with Russian filenames.
It seems that unzip converts filenames by default from cp850 to cp1252. And this conversion ruins every other encoding. I'm trying to unzip a Windows-created zip archive with Russian filenames (cp866). Now, to restore the correct filenames I have to use the following set of commands as a workaround:

unzip filename.zip
convmv --nosmart --notest -f cp1252 -t cp850 *
convmv --nosmart --notest -f cp866 -t utf8 *

And this works.

It would be much more handy to have unzip autodetect the proper conversion based on the system locale (i.e. cp866->utf8 for Russian, CP936->utf8 for Chinese, etc.) or to be able to specify an unzip command line parameter to override the default conversion of cp850->cp1252 in case autodetection fails.

--------
Tried on Lucid Lynx 10.04 LTS, unzip 6.0-1build1, convmv 1.12
See an example zip with cp866 filenames attached.

Revision history for this message
Seunghoon Park (pclove1) wrote :

Same problem with Korean(CP949)
I hope that it will be fixed soon :)

Revision history for this message
Rolf Leggewie (r0lf) wrote :

Jiahua (comment 11), thank you for sharing that patch. Unfortunately, it only "fixes" the problem inthe Chinese case and thus is not a real solution.

Andreas (comment 17), thank you for sharing your findings. Nonetheless, this is the same as bug 477755.

Marking as dupe. Thank you.

Revision history for this message
liew (junyee-619) wrote :

helo

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in p7zip (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.