Comment 4 for bug 1202395

Revision history for this message
Mark Sapiro (msapiro) wrote : Re: [Bug 1202395] Re: sync_members crashes for UTF-8 real name

On 07/19/2013 05:32 PM, Cedders wrote:
>
> Thanks for the reply. By the way, it was you who suggested this
> approach, and I still think you were right back then!

I know, but that was almost 6 years ago, and there are issues with that
approach.

> Firstly, according to http://wiki.python.org/moin/DefaultEncoding,
> sys.getdefaultencoding() is pretty much deprecated and will be removed
> in Python 3.0 (as you say "Python's default encoding is ascii regardless
> of locale").

True, but this is Mailman 2.1 and Python 2.x and Mailman 2.1 will never
be made compatible with Python 3.

> Secondly, I don't think the input to sync_members should
> be interpreted as a 7-bit message header with possibly RFC 2047
> encoding.

I didn't say it should be. I said that the return from
email.Utils.formataddr() should be 7-bit ascii, but that would make for
an ugly report, particularly if things were RFC 2047 base-64 encoded.

> Finally, yes, modifying site.py as you describe does fix both problems
> (with or without the patch), but in practice are most sysadmins likely
> to do that? If they fail to m odify it, should sync_members crash? And
> what if for some reason the system locale changes to, eg iso-8859-1?

If you enable the locale encoding in site.py, it gets the encoding from
local.getdefaultlocale() so it should be locale aware. If you go the
sitecustomize.py route, you can use something like this (adapted from
site.py)

import sys
import locale
loc = locale.getdefaultlocale()
if loc[1]:
    sys.setdefaultencoding(loc[1])

> On
> a site with a UTF-8 encoding, as I unders tand it, all this
> functionality does is convert from utf-8 to utf-8. There is a per-list
> encoding, as might be useful on a non-unicode system hosting lists in
> both ISO-8859-5 and ISO-8859-1, but as far as I can see, the list
> encoding is not taken into account in the command-line scripts.

That's true, but the encoding for the list's language might not be
compatible with the encoding for the console that's running sync_members
or list_members.

> I did wonder if assigning
> enc = locale.getdefaultlocale()[1] or locale.getpreferredencoding() or "UTF8"
> within the script would help (outputting to correct encoding for console), but it doesn't; as you say it's the implied decode on the output of formataddr and join that is not seen as a Unicode string. Logically perhaps it should first be decoded from the input encoding and re-encoded as enc, the expected encoding in the system locale; but that's equivalent to doing nothing.

It's really a can of worms. Dropping the encode() is probably fine most
of the time, but we really don't know what the encoding is for the input
to sync_members. It could be different from and incompatible with the
default for the locale.

> If the defaultencoding approach were to be implemented in Python in
> future in a way that doesn't cause this problem (beyond being applied in
> concatenation and join), then encoding the strings from (for example) an
> ISO-8859-5 to give legible output on a UTF-8 console would be the way to
> go. But it doesn't look to me like that is the way the wind is blowing.

But how do I know that the input to sync_members and hence the output
from email.Utils.formataddr() is iso-8859-5 (or whatever it is) encoded?

I understand that there's an issue and that modifying site.py or adding
something to sitecustomize.py is not a solution that is viable for all.
I'm just reluctant to open this can of worms.

--
Mark Sapiro <email address hidden> The highway is for gamblers,
San Francisco Bay Area, California better use your sense - B. Dylan