Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows: Invalid UTF-8 stream when using Pandoc as a filter #3208

Closed
georgejean opened this issue Nov 3, 2016 · 11 comments
Closed

Windows: Invalid UTF-8 stream when using Pandoc as a filter #3208

georgejean opened this issue Nov 3, 2016 · 11 comments

Comments

@georgejean
Copy link

georgejean commented Nov 3, 2016

Hi,

I'm using Pandoc 1.17.2 on Windows 10.

When using pandoc as a filter (giving a string with accent as input)

pandoc
# Gérard
^Z

it displays

pandoc.exe: Cannot decode byte '\x82': Data.Text.Internal.Encoding.Fusion.streamUtf8: Invalid UTF-8 stream

I get the expected output on the terminal if I put the text in a UTF-8 encoded file essai.txt and use in a Powershell console

$outputencoding = [ System.Text.Encoding]::UTF8
[Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8
get-content essai.txt | pandoc

I've made some attempts. For example, inserting

$a = [ System.Text.Encoding]::UTF8
$outputencoding = $a
[Console]::OutputEncoding = $a
[Console]::IntputEncoding = $a

before

pandoc
# Gérard

displays a blank output when we finish the first line by pressing the key enter.

@jgm
Copy link
Owner

jgm commented Nov 3, 2016

Can someone else who uses pandoc on Windows try to confirm
this? I don't have a Windows setup handy.

@sergiocorreia
Copy link

sergiocorreia commented Nov 3, 2016

I can reproduce it, but TBH there are workarounds:

C:\Users\Sergio>C:\Users\Sergio>echo é | pandoc
pandoc: Cannot decode byte '\x82': Data.Text.Internal.Encoding.Fusion.streamUtf8: Invalid UTF-8 stream

C:\Users\Sergio>chcp 65001

C:\Users\Sergio>echo é | pandoc
<p>é</p>

See this thread for more details.

Now, I wouldn't be surprised if there are bugs in citeproc or other programs that eat from stdin, as early on I also had some troubles with this and panflute (workaround). It seems that you need to specify you want to use a utf8 encoding, or else some other encoding might be used, which could corrupt the input.

@georgejean
Copy link
Author

georgejean commented Nov 4, 2016

Ok, thanks for your responses and for the tip. This is not a big problem.

Just a remark: In a powershell console, chcp 65001 is not enough.
It must be replaced by

$outputencoding=[console]::outputencoding=[text.encoding]::utf8

and allows to do such things:

"Gérard"|pandoc
<p>Gérard</p>

or

$a, $b, $c = "Manger équilibré", "pâtes", "râpé"
@"
# $a
## Les $b
Faites cuire 8 minutes les $b dans deux volumes d'eau. Servez les $b avec du fromage *$c*   
"@|pandoc
<h1 id="manger-équilibré">Manger équilibré</h1>
<h2 id="les-pâtes">Les pâtes</h2>
<p>Faites cuire 8 minutes les pâtes dans deux volumes d'eau. Servez les pâtes avec du fromage <em>râpé</em></p>

@jgm
Copy link
Owner

jgm commented Nov 5, 2016

Nothing to change in pandoc, correct?

@georgejean
Copy link
Author

georgejean commented Nov 6, 2016

I'm not able to answer this question.
In a nutshell:
I can't test the Step 4: Using pandoc as a filter of getting-started on Windows because it doesn't work when strings contain accents.
I've not found a solution to this problem (I don't know how if there is a correct encoding's configuration so that it works. Therefore I don't know if the problem comes from Pandoc) but there are easy workarounds.

@jgm
Copy link
Owner

jgm commented Nov 7, 2016

It seems to me that you have the solution above: if you set the console encoding to UTF-8 using

$outputencoding=[console]::outputencoding=[text.encoding]::utf8

then it works, correct? Pandoc documents that it expects all input, and gives all output, in UTF-8. And it seems that the issue was that your console was not set up by default this way.

But perhaps I've misunderstood, because you say "I've not found a solution."

@georgejean
Copy link
Author

In a powershell console,

$outputencoding=[console]::outputencoding=[text.encoding]::utf8
"é"|pandoc

works fine but

$outputencoding=[console]::outputencoding=[text.encoding]::utf8
pandoc
é

outputs (after ctrl+z and Enter)

pandoc.exe: Cannot decode byte '\x82': Data.Text.Internal.Encoding.Fusion.streamUtf8: Invalid UTF-8 stream

In the standard command prompt

chcp 65001
echo é | pandoc

works, but

chcp 65001
pandoc
é

doesn't work.
There must be a way to properly configure the encoding, but I've not found yet.

@sergiocorreia
Copy link

I see your point:

>chcp
Active code page: 437

>chcp 65001
Active code page: 65001

C:\Users\Sergio>pandoc
a
é
<p>a</p>

In this case, pandoc exits silently after reading a non-ascii character

PS: These one liners sets cmd and PowerShell to use UTF8 by default, in case it's useful

@jgm
Copy link
Owner

jgm commented Nov 8, 2016

I understand the issue now, but I can't see anything in
pandoc that is wrong. I will have to do some experiments
in Windows to see better what is going on.

@georgejean
Copy link
Author

Thanks for your replies.
Other experiments:

1

[console]::inputencoding=[text.encoding]::ascii
$outputencoding=[console]::outputencoding=[text.encoding]::utf8
pandoc
a

é

output:

<p>a</p>
<p>e</p>

2 (doesn't work)

$outputencoding=[console]::outputencoding=[console]::inputencoding=[text.encoding]::utf8
pandoc
a
é

output: (without Ctrl-Z)

<p>a</p>

@mb21
Copy link
Collaborator

mb21 commented Feb 2, 2019

Closing this, since it looks like an issue with the windows console, not pandoc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants