Localised Ubuntu start pages (8.10) have corrupted UTF-8 text

Bug #290494 reported by David Planella
96
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Ubuntu Website - OBSOLETE
Fix Released
High
Matthew Nuzum
ubuntu-docs (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

The generator scripts for the Ubuntu 8.10 start page corrupt the text, if that text uses non-latin characters.
To see the page for your language, visit
http://start.ubuntu.com/8.10/index.html.LL
where LL is the language code (for example, 'el', 'ru' and so on).

Although the PO files that the translators produced have the UTF-8 encoding, the scripts that create the HTML pages mistakenly assume that the source encoding is not UTF-8 (but rather iso-8859-1).
This corrupts the text of the pages.

The solution is to find where in the scripts the text gets corrupted. As soon as the problem is fixed, the Start pages will appear properly when you visit the page again.

Old description -----------

Binary package hint: ubuntu-docs

The submitted Catalan translation of the browser start page correctly spelled the title of the start page in UTF-8 format [1]:

"Pàgina inicial de l'Ubuntu"

However, the released start page did some kind of conversion to the "à" character, which has been converted to some unreadable character and thus is not being displayed correctly (see attached screenshot).

Note: I am reporting this against ubuntu-docs because the browser start page translation used to be here. With the last-minute changes to the browser start page I do not know where it resides anymore. Please reassign if necessary.

[1] https://lists.ubuntu.com/archives/ubuntu-translators/2008-October/001837.html

Revision history for this message
David Planella (dpm) wrote :
Revision history for this message
David Planella (dpm) wrote :

Please find attached the original translation as submitted to the ubuntu-translators list.

Revision history for this message
Artem Popov (artfwo) wrote :

Confirming, the title (and contents) is broken for Russian page as well

Changed in ubuntu-docs:
status: New → Confirmed
Revision history for this message
Felipe Gil Castiñeira (xil) wrote :

The problem is in the "localize.sh" script. It seems that po2html does not manage correctly the accents in the po file. A work-around is the usage of HTML codes for special characters [1] in the .po file instead of the utf-8 encoded characters (e.g. "Páxina" instead of "Páxina").

[1] http://webdesign.about.com/library/bl_htmlcodes.htm

Revision history for this message
David Planella (dpm) wrote :

In our case the script generated a correct page, though. It is only the published page which does not seem to be encoded in UTF-8

Revision history for this message
Artem Popov (artfwo) wrote :

Maybe HTML-Tidy produces such an output? I have tried to run it locally and looks like it does not detect utf-8 automatically and converts international characters into unreadable stuff...

Revision history for this message
Matthew East (mdke) wrote :

This page an online page and is not part of the ubuntu-docs package.

Changed in ubuntu-docs:
status: Confirmed → Invalid
Changed in ubuntu-website:
assignee: nobody → newz
importance: Undecided → High
Revision history for this message
Dávid Gábor Bodor (drag0nfi) wrote :

Affecting the hungarian startpage too, but I guess you already know.

Revision history for this message
Dávid Gábor Bodor (drag0nfi) wrote :
Revision history for this message
Fumihito YOSHIDA (hito) wrote :

Affecting the Japanese startpage too.

Revision history for this message
Fumihito YOSHIDA (hito) wrote :
Revision history for this message
David Henningsson (diwic) wrote :

Confirmed for the Swedish start page. It looks like something that is already UTF-8-encoded, is being transformed from something else to UTF-8-encoding once more, as every character > 127 takes up four bytes (checked with hex editor).

Revision history for this message
Matthew Nuzum (newz) wrote :

Working on a solution. Seems to be po2html causing the problem.

Changed in ubuntu-website:
status: New → Confirmed
Revision history for this message
vista killer (vistakiller) wrote :

Affecting the Greek startpage too

description: updated
Revision history for this message
Julian Alarcon (julian-alarcon) wrote :

Also the Spanish home page.

Revision history for this message
Gabor Kelemen (kelemeng) wrote :

Possible solution/workaround/whatever: https://lists.ubuntu.com/archives/ubuntu-translators/2008-October/001886.html
Could somebody confirm if this is a viable way? No reply on the list yet :(. Is it just me or that problem is really _that_ difficult?

Revision history for this message
Bruno (bruno666-666) wrote :

I've just send a possible fix to the mailing list ubuntu-translators. HTH

Revision history for this message
Bruno (bruno666-666) wrote :

Here's the patch to apply to po2html.py

--- translate-toolkit-1.1.1/translate/convert/po2html.py.old 2008-11-05 17:18:17.000000000 +0100
+++ translate-toolkit-1.1.1/translate/convert/po2html.py 2008-11-05 17:18:50.000000000 +0100
@@ -81,7 +81,7 @@
                 htmlresult = htmlresult.replace(msgid, msgstr, 1)
         htmlresult = htmlresult.encode('utf-8')
         if tidy:
- htmlresult = str(tidy.parseString(htmlresult))
+ htmlresult = str(tidy.parseString(htmlresult, **{'char_encoding': "utf8"}))
         return htmlresult

 def converthtml(inputfile, outputfile, templatefile, wrap=None, includefuzzy=False):

Revision history for this message
Felipe Gil Castiñeira (xil) wrote :

I can confirm that this patch works correctly (at least for the languages I speak).

Revision history for this message
Matthew Nuzum (newz) wrote :

A fix has been implemented but a more long-term solution is needed. I will be working on this in the context of the ubuntu-website team, anyone interested in contributed to the solution is welcome and encouraged to join.

Changed in ubuntu-website:
status: Confirmed → Fix Released
Revision history for this message
Gabor Kelemen (kelemeng) wrote :

Strange, lots of languages (ja, ka, cs, bn) look fine now, but Hungarian and Russian not, see: http://start.ubuntu.com/8.10/index.html.ru, http://start.ubuntu.com/8.10/index.html.hu.

Revision history for this message
Daniel Nylander (yeager) wrote :

Confirmed for Swedish.
Still waiting for the Search string fix (which is "Sök" in Swedish)

Revision history for this message
Gabor Kelemen (kelemeng) wrote :

Forget my previous comment, they are fine. Perhaps it was just my browser cache :(.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.