sync_members crashes for UTF-8 real name

Bug #1202395 reported by Cedders
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
GNU Mailman
New
Undecided
Unassigned

Bug Description

This was reported in on mailman-users <http://mail.python.org/pipermail/mailman-users/2007-October/058689.html> and a fix suggested by Mark Sapiro.

Steps to reproduce:
1) Create a text file encoded in UTF-8 including a line such as
Cédríc <email address hidden>
2) Use a list test-list ensuring <email address hidden> is not already a member of test-list
3) run sync_members --no-change --welcome-msg=no --goodbye-msg=no --notifyadmin=no -f testutf8.txt test-list

Expected results:
address is added with
Added : Cédríc <email address hidden>

Actual results:
  File "/usr/sbin/sync_members", line 259, in main
    s = email.Utils.formataddr((name, addr)).encode(enc, 'replace')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

Attached patch applies to 2.1 and 2.2 head.

Note that there is also a related issue with list_members -f, where safe() also encodes to 7-bit, resulting in
C??dr??c <email address hidden>
This is on a Debian 6.0.7 system with a UTF-8 locale.

Revision history for this message
Cedders (cedric-gn) wrote :
Revision history for this message
Mark Sapiro (msapiro) wrote :

This is complicated. It is not clear that this is a bug, and if it is a bug, it is not clear that the bug is in sync_members.

The problem occurs in the statements

    s = email.Utils.formataddr((name, addr)).encode(enc, 'replace')

when name contains non-ascii. The first issue is that the job of email.Utils.formataddr() is to take a name and address pair and return a string (e.g. 'name <addr>') suitable for inclusion on a To:, From:, Cc:, etc. email message header. Headers are not allowed to contain non-ascii, so it could be argued that if name contains non-ascii, the result returned by email.Utils.formataddr() should be RFC 2047 encoded so it doesn't contain non-ascii.

Ignoring that, the next issue is that Python's default encoding is ascii regardless of locale. Thus, when we try to encode() the string returned by email.Utils.formataddr(), Python must first decode it and does this using the ascii codec which throws the exception. Removing the encode() as the suggested patch does avoids this, but is not, I think, the best way to fix this.

I think the proper fix is to make your Python locale aware by editing the /usr/lib/pythonv.v/site.py module and changing the first

    if 0:

in the definition of setencoding() to

    if 1:

This will not only fix this issue with sync_members, it will also fix the garbled output from list_mermbers -f and probably other cases of non-ascii being replaced with '?' in the command line scripts.

Another way to do this is to add

import sys
sys.setdefaultencoding('utf-8')

to the sitecustomize.py module (/etc/pythonv.v/sitecustomize.py on Debian).

Revision history for this message
Cedders (cedric-gn) wrote :

Hi Mark

Thanks for the reply. By the way, it was you who suggested this approach, and I still think you were right back then!

Firstly, according to http://wiki.python.org/moin/DefaultEncoding, sys.getdefaultencoding() is pretty much deprecated and will be removed in Python 3.0 (as you say "Python's default encoding is ascii regardless of locale"). Secondly, I don't think the input to sync_members should be interpreted as a 7-bit message header with possibly RFC 2047 encoding. Thirdly, add_members does not have this problem. Fourthly, if you did escape the non-ASCII characters with base64 or quoted-printable at some point, then these would presumably show up in the command output (and possibly the web interface).

Finally, yes, modifying site.py as you describe does fix both problems (with or without the patch), but in practice are most sysadmins likely to do that? If they fail to modify it, should sync_members crash? And what if for some reason the system locale changes to, eg iso-8859-1? On a site with a UTF-8 encoding, as I unders tand it, all this functionality does is convert from utf-8 to utf-8. There is a per-list encoding, as might be useful on a non-unicode system hosting lists in both ISO-8859-5 and ISO-8859-1, but as far as I can see, the list encoding is not taken into account in the command-line scripts.

I did wonder if assigning
   enc = locale.getdefaultlocale()[1] or locale.getpreferredencoding() or "UTF8"
within the script would help (outputting to correct encoding for console), but it doesn't; as you say it's the implied decode on the output of formataddr and join that is not seen as a Unicode string. Logically perhaps it should first be decoded from the input encoding and re-encoded as enc, the expected encoding in the system locale; but that's equivalent to doing nothing.

If the defaultencoding approach were to be implemented in Python in future in a way that doesn't cause this problem (beyond being applied in concatenation and join), then encoding the strings from (for example) an ISO-8859-5 to give legible output on a UTF-8 console would be the way to go. But it doesn't look to me like that is the way the wind is blowing.

Hope this makes sense.

Revision history for this message
Mark Sapiro (msapiro) wrote : Re: [Bug 1202395] Re: sync_members crashes for UTF-8 real name
Download full text (3.6 KiB)

On 07/19/2013 05:32 PM, Cedders wrote:
>
> Thanks for the reply. By the way, it was you who suggested this
> approach, and I still think you were right back then!

I know, but that was almost 6 years ago, and there are issues with that
approach.

> Firstly, according to http://wiki.python.org/moin/DefaultEncoding,
> sys.getdefaultencoding() is pretty much deprecated and will be removed
> in Python 3.0 (as you say "Python's default encoding is ascii regardless
> of locale").

True, but this is Mailman 2.1 and Python 2.x and Mailman 2.1 will never
be made compatible with Python 3.

> Secondly, I don't think the input to sync_members should
> be interpreted as a 7-bit message header with possibly RFC 2047
> encoding.

I didn't say it should be. I said that the return from
email.Utils.formataddr() should be 7-bit ascii, but that would make for
an ugly report, particularly if things were RFC 2047 base-64 encoded.

> Finally, yes, modifying site.py as you describe does fix both problems
> (with or without the patch), but in practice are most sysadmins likely
> to do that? If they fail to m odify it, should sync_members crash? And
> what if for some reason the system locale changes to, eg iso-8859-1?

If you enable the locale encoding in site.py, it gets the encoding from
local.getdefaultlocale() so it should be locale aware. If you go the
sitecustomize.py route, you can use something like this (adapted from
site.py)

import sys
import locale
loc = locale.getdefaultlocale()
if loc[1]:
    sys.setdefaultencoding(loc[1])

> On
> a site with a UTF-8 encoding, as I unders tand it, all this
> functionality does is convert from utf-8 to utf-8. There is a per-list
> encoding, as might be useful on a non-unicode system hosting lists in
> both ISO-8859-5 and ISO-8859-1, but as far as I can see, the list
> encoding is not taken into account in the command-line scripts.

That's true, but the encoding for the list's language might not be
compatible with the encoding for the console that's running sync_members
or list_members.

> I did wonder if assigning
> enc = locale.getdefaultlocale()[1] or locale.getpreferredencoding() or "UTF8"
> within the script would help (outputting to correct encoding for console), but it doesn't; as you say it's the implied decode on the output of formataddr and join that is not seen as a Unicode string. Logically perhaps it should first be decoded from the input encoding and re-encoded as enc, the expected encoding in the system locale; but that's equivalent to doing nothing.

It's really a can of worms. Dropping the encode() is probably fine most
of the time, but we really don't know what the encoding is for the input
to sync_members. It could be different from and incompatible with the
default for the locale.

> If the defaultencoding approach were to be implemented in Python in
> future in a way that doesn't cause this problem (beyond being applied in
> concatenation and join), then encoding the strings from (for example) an
> ISO-8859-5 to give legible output on a UTF-8 console would be the way to
> go. But it doesn't look to me like that is the way the wind is blowing.

But how do I know that the inpu...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.