want an option to set the output encoding, especially on win32

Bug #340394 reported by Timmie
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Bazaar
Fix Released
Medium
Martin Pool

Bug Description

Sometimes people want to produce output in an encoding that's not their system default, or to produce squashed output from strict commands or vice versa. It would be good if there was an option to do this.

-------------

"bzr log > log.out" on win32 where the commit messages contain non-ascii characters seems to produce invalidly encoded output. It works correctly on Linux, and it produces the correct output in a cmd.exe window.

----------------

When saving the log output to a file, this should be saved in UTF-8 encoded form.

Please also refer to:
https://answers.edge.launchpad.net/bzr/+question/63601

Tags: unicode win32
Revision history for this message
Timmie (timmie) wrote :

Attached you may find a file with an exmaple repository.

Please do:
1) extract
2) bzr log => see changelog with umlauts
3) inspect the file: changelog_test.txt

Revision history for this message
Martin Pool (mbp) wrote :

Discussion in the question

Changed in bzr:
status: New → Incomplete
Revision history for this message
Timmie (timmie) wrote :

Hello,
I assigned this bug to Martin Pool and changed the status to confirmed:
See the latest comment proving that the wrong encoding is used under windows:
https://answers.edge.launchpad.net/bzr/+question/63601

It would be nice if this could be fixed in the next version of BZR.

Changed in bzr:
status: Incomplete → Confirmed
assignee: nobody → mbp
tags: added: log
Martin Pool (mbp)
Changed in bzr:
assignee: mbp → nobody
importance: Undecided → Medium
summary: - log output should be saved in user defined encoding
+ redirecting log output to a file produces the wrong encoding on win32
description: updated
tags: added: win32
Revision history for this message
Martin Pool (mbp) wrote : Re: redirecting log output to a file produces the wrong encoding on win32

Tim, could you please attach the file produced by redirecting the output on Windows, using your test repository? Can you determine which encoding it is in?

Revision history for this message
Timmie (timmie) wrote :

> please attach the file produced by redirecting the output on Windows, using your test repository?
Please see attached.
> Can you determine which encoding it is in?
No. I cannot. I tried different ones but none seems fitting.

Revision history for this message
Martin Pool (mbp) wrote :

Thanks Tim.

It looks like that's in IBM-850 (aka cp850), with ü as codepoint 0x81 and ö as cp 0x94. It may be that cp850 is the right thing to write to the terminal but if you're in a utf-8 locale and redirected to a file it should probably write utf-8.

One more thing, what happens if you do "type changelog_test.txt" in the terminal?

Martin Pool (mbp)
summary: - redirecting log output to a file produces the wrong encoding on win32
+ redirecting log output to a file produces cp850 on win32
Revision history for this message
Timmie (timmie) wrote : Re: redirecting log output to a file produces cp850 on win32

> One more thing, what happens if you do "type changelog_test.txt" in the terminal?
The contents are displayed correctly.

Revision history for this message
Wouter van Heyst (larstiq) wrote : Re: [Bug 340394] Re: redirecting log output to a file produces the wrong encoding on win32

On Thu, Apr 09, 2009 at 01:39:12AM -0000, Martin Pool wrote:
> Thanks Tim.
>
> It looks like that's in IBM-850 (aka cp850), with ü as codepoint 0x81
> and ö as cp 0x94. It may be that cp850 is the right thing to write to
> the terminal but if you're in a utf-8 locale and redirected to a file it
> should probably write utf-8.

Drive-by-bug-commenting: aren't the filesystem encoding and the terminal
encoding seperated in Windows? I recall bialix having trouble in that
area in the past.

Wouter van Heyst

Revision history for this message
Alexander Belchenko (bialix) wrote : Re: redirecting log output to a file produces cp850 on win32

Filesystem encoding on windows is always mbcs (roughly the same as UTF-16). Terminal encoding is OEM by default, so cp850 is Latin-1 OEM encoding. It's correct. Corresponding ANSI encoding (used by GUI) will be cp1252 (Latin-1 ANSI encoding). OS Windows has no explicit support for utf-8 encoding.

The lists of supported ANSI and OEM encodings:
http://msdn.microsoft.com/ru-ru/goglobal/bb964654(en-us).aspx
http://msdn.microsoft.com/ru-ru/goglobal/bb964655(en-us).aspx

Revision history for this message
Martin Pool (mbp) wrote :

2009/4/21 Alexander Belchenko <email address hidden>:
> Martin Pool пишет:
>>
>> https://launchpad.net/bugs/340394
>>
>> Does anyone have a clear idea of
>> 1- what output encoding bzr should use on Windows when output is
>> redirected to a file and running in a terminal using cp850?
>
> What do you mean by "should"?
>
> Current implementation uses terminal encoding, in this case it will be
> cp850. If terminal is not available, then there will be used default user
> encoding, in this case it will be cp1252.
>
> Here is the list of supported encodings (ANSI and OEM):
> http://msdn.microsoft.com/ru-ru/goglobal/bb964654(en-us).aspx
> http://msdn.microsoft.com/ru-ru/goglobal/bb964655(en-us).aspx
>
> As you see there is no UTF-8 encoding in any form, so user unable to switch
> the terminal encoding to UTF-8. Only to cp1252, because it's the best match
> for cp850 (cp850 is OEM, cp1252 is ANSI encoding).
>
> To switch encoding of the terminal user can use chcp command.
> When this command invoked without arguments it will print current codepage.
>
> E.g. on my machine with Russian settings:
>
> C:\>chcp
> Active code page: 866
>
> C:\>chcp 1251
>
> C:\>chcp
> Active code page: 1251
>
>> 2- what mechanism if any should be available to control it?
>>
>> I suppose we could (like svn?) have a parameter that specifies the
>> output encoding...
>
> Yes, PLEASE. I think it should be global option, available to all commands.
> E.g.
>
> bzr --encoding=utf-8 log

OK, so I've changed bug 340394 to be about that, as it seems that we can't do any better at present by just using the encoding of the environment.

--
Martin <http://launchpad.net/~mbp/>

summary: - redirecting log output to a file produces cp850 on win32
+ want an option to set the output encoding, especially on win32
Martin Pool (mbp)
description: updated
tags: added: unicode
removed: log
Revision history for this message
Timmie (timmie) wrote :

Coming back on this:
How does the bzr log > mylog.txt need to be converted to be usable with non-english characters?

Shall we write out to cp1252?
How would I read the text file back in with no worries?

Revision history for this message
Alexander Belchenko (bialix) wrote :

Tim, you can change terminal encoding (for current session) with chcp command, e.g.:

chcp 1252

After that `bzr log > mylog.txt` will write file in cp1252 encoding.

Revision history for this message
Timmie (timmie) wrote :

I may add an observation:
When I inspect the revision log in the qbzr (Windows Explorer Extension) I can see the Umlauts correctly.
Why can this correct revision log not be saved into a text file?
If anyone has an idea on who to export/include the bzr log into a Sphinx based documentation without saving the log to a file please pass it to me.

Revision history for this message
Timmie (timmie) wrote :

referring to #12

As noted in the inital Q&A, I aim to retrieve the log by a scipt ( not interactive in windows console).
Also, this approach is not cross-platform. e.g. when my sphinx project is run on linux, this cp1252 is not useful.

> chcp = subprocess.Popen(('chcp', '1252'), bufsize=-1)
---------------------------------------------------------------------------
WindowsError Traceback (most recent call last)

> in <module>()

C:\Programme\pythonxy\python\lib\subprocess.pyc in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags)
    592 p2cread, p2cwrite,
    593 c2pread, c2pwrite,
--> 594 errread, errwrite)
    595
    596 # On Windows, you cannot just redirect one or two handles: You

C:\Programme\pythonxy\python\lib\subprocess.pyc in _execute_child(self, args, executable, preexec_fn, close_fds, cwd, env, universal_newlines, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite)
    814 env,
    815 cwd,
--> 816 startupinfo)
    817 except pywintypes.error, e:
    818 # Translate pywintypes.error to WindowsError, which is

WindowsError: [Error 2] Das System kann die angegebene Datei nicht finden

Why can the content of .bzr\checkout\dirstate not be converted in a simple text file that is either UTF-8 or latin1?

Revision history for this message
Alexander Belchenko (bialix) wrote : Re: [Bug 340394] Re: want an option to set the output encoding, especially on win32

Tim пишет:
> referring to #12
>
> As noted in the inital Q&A, I aim to retrieve the log by a scipt ( not interactive in windows console).
> Also, this approach is not cross-platform. e.g. when my sphinx project is run on linux, this cp1252 is not useful.

If you do it from the python script then it's mauch simpler for you to
import bzrlib and get log as unicode or utf-8 encoded string.

Revision history for this message
Timmie (timmie) wrote :

http://bazaar-vcs.org/Integrating_with_Bazaar got me started.

1 : import bzrlib
2 : from bzrlib import log
6 : from bzrlib.branch import Branch
7 : b = Branch.open('./myscripts/')
11: import sys
12: lf = log.LongLogFormatter(to_file=sys.stdout)
14: log.show_log(b, lf)

But it doesn't seem that simple:

---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)

~/workspace/<ipython console> in <module>()

/usr/lib/python2.6/dist-packages/bzrlib/log.pyc in show_log(branch, lf, specific_fileid, verbose, direction, start_revision, end_revision, search, limit, show_diff)
    207 limit=limit, message_search=search,
    208 delta_type=delta_type, diff_type=diff_type)
--> 209 Logger(branch, rqst).show(lf)
    210
    211

/usr/lib/python2.6/dist-packages/bzrlib/log.pyc in show(self, lf)
    328 if getattr(lf, 'begin_log', None):
    329 lf.begin_log()
--> 330 self._show_body(lf)
    331 if getattr(lf, 'end_log', None):
    332 lf.end_log()

/usr/lib/python2.6/dist-packages/bzrlib/log.pyc in _show_body(self, lf)
    353 generator = self._generator_factory(self.branch, rqst)
    354 for lr in generator.iter_log_revisions():
--> 355 lf.log_revision(lr)
    356 lf.show_advice()
    357

/usr/lib/python2.6/dist-packages/bzrlib/log.pyc in log_revision(self, revision)
   1457 message = revision.rev.message.rstrip('\r\n')
   1458 for l in message.split('\n'):
-> 1459 to_file.write(indent + ' %s\n' % (l,))
   1460 if revision.delta is not None:
   1461 # We don't respect delta_format for compatibility

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 9: ordinal not in range(128)

What is wrong?
Please give some advice.

Thanks in advance.

Revision history for this message
Alexander Belchenko (bialix) wrote :

Tim пишет:
> http://bazaar-vcs.org/Integrating_with_Bazaar got me started.
>
> 1 : import bzrlib
> 2 : from bzrlib import log
> 6 : from bzrlib.branch import Branch
> 7 : b = Branch.open('./myscripts/')
> 11: import sys
> 12: lf = log.LongLogFormatter(to_file=sys.stdout)
> 14: log.show_log(b, lf)
>
> But it doesn't seem that simple:
...
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 9: ordinal not in range(128)
>
> What is wrong?
> Please give some advice.

sys.stdout is not what you want. You want to avoid creating temp file?
Then it's better to dump log into memory. You can't dump pure unicode to
stdout, do you? What about utf-8 instead? Ok?

So let's use StringIO Python standard library.

import codecs
from cStringIO import StringIO
import sys

import bzrlib
from bzrlib import log
from bzrlib.branch import Branch

# prepare utf-8 encoded stream
sio = codec.getwriter('utf-8')(StringIO())

# open branch
b = Branch.open('./myscripts/')

# print log to our stream
lf = log.LongLogFormatter(to_file=sio)
log.show_log(b, lf)

# get the log as utf-8 string
s = sio.getvalue()

# and then we can close our stream
sio.close()

Then you can use utf-8 data in s as you wish. You can print it to stdout:

sys.stdout.write(s)

or do something else.

HTH

Revision history for this message
Timmie (timmie) wrote :

Sitenote: I discovered another bug through this:
https://bugs.launchpad.net/bzr/+bug/416373

Revision history for this message
Timmie (timmie) wrote :

Hi Alexander,
I checked your hints.
The result ("s") is what I was aiming at.
Thanks a lot for this workaround.

I still do not consider this bug solved:
1) next to the standalone windows installer I have to take care now that bzrlib is available
2) I consider
p_log = subprocess.Popen(('bzr log --short'),
                     stdout=subprocess.PIPE, stderr=subprocess.PIPE, bufsize=-1)
as faster.
3) why these encoding problems?

As a suggestion:
Incorporate your code in the "bzr log" command:
bzr log --export-encoding "utf-8" --file-name "mylog.txt"
and everyone is happy and this bug could be closed.

What do you think.?

Thanks again for your help.

Revision history for this message
Alexander Belchenko (bialix) wrote :

Tim пишет:
> Hi Alexander,
> I checked your hints.
> The result ("s") is what I was aiming at.
> Thanks a lot for this workaround.
>
> I still do not consider this bug solved:
> 1) next to the standalone windows installer I have to take care now that bzrlib is available

You can use installed bzr.exe and its bzrlib if you're using Python 2.5.
I can teach you if you wish.
But AFAICS you're using Python 2.6, so it won't work.

> 2) I consider
> p_log = subprocess.Popen(('bzr log --short'),
> stdout=subprocess.PIPE, stderr=subprocess.PIPE, bufsize=-1)
> as faster.

I don't quite understand this.

> 3) why these encoding problems?

Because bzr:

a) written mostly by ASCII-only people who don't see all these unicode
problems every day?
b) written by Linux developers who have utf-8 as their standard encoding
everywhere and therefore always have utf-8 output and don't encounter
UnicodeErrors?

> As a suggestion:
> Incorporate your code in the "bzr log" command:
> bzr log --export-encoding "utf-8" --file-name "mylog.txt"

My code is far from this.

> and everyone is happy and this bug could be closed.

I can help you write plugin to do this. It will be much simpler better
and easier to achieve. I don't do bzr development on regular basis, and
I'm doubt anyone else will implement this. If you wish I can guide you
in writing such patch for bzr. But writing plugin is really simpler.

Martin Pool (mbp)
Changed in bzr:
assignee: nobody → Martin Pool (mbp)
status: Confirmed → In Progress
Revision history for this message
John A Meinel (jameinel) wrote :

We can now set the output via a configuration option. Though I guess we also want another bug about not being able to set this on the command line?

Changed in bzr:
milestone: none → 2.2.0
status: In Progress → Fix Released
Revision history for this message
Martin Pool (mbp) wrote : Re: [Bug 340394] Re: want an option to set the output encoding, especially on win32

On 7 August 2010 05:01, John A Meinel <email address hidden> wrote:
> We can now set the output via a configuration option. Though I guess we
> also want another bug about not being able to set this on the command
> line?

Yes, that'd be covered by bug 491196.
--
Martin

Revision history for this message
Timmie (timmie) wrote :

Great that you fixed this! Thanks.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.