Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doesn't work with cyrillic texts #33

Open
egalion opened this issue Aug 26, 2014 · 1 comment · May be fixed by #36
Open

Doesn't work with cyrillic texts #33

egalion opened this issue Aug 26, 2014 · 1 comment · May be fixed by #36

Comments

@egalion
Copy link

egalion commented Aug 26, 2014

The current version doesn't work with cyrillic texts. It gives a Unicode error.

More specifically:

  • with markdown
Unexpected Error:  <type 'exceptions.UnicodeDecodeError'>
Traceback (most recent call last):
  File "criticParser_CLI.py", line 348, in <module>
    h = markdown.markdown(h, extensions=['extra', 'codehilite', 'meta'])
  File "/usr/lib/python2.7/dist-packages/markdown/__init__.py", line 396, in markdown
    return md.convert(text)
  File "/usr/lib/python2.7/dist-packages/markdown/__init__.py", line 266, in convert
    source = unicode(source)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128). -- Note: Markdown only accepts unicode input!
  • with markdown2
Using the Markdown2 module for processing
/path-to-program/CriticMarkup-toolkit/CLI/1.html
Unexpected Error:  <type 'exceptions.UnicodeEncodeError'>
Traceback (most recent call last):
  File "criticParser_CLI.py", line 371, in <module>
    filesource.write(h)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3667-3670: ordinal not in range(128)

I found a workaround after some googling. It may not be very elegant, but it does the job. It applies to the command line tool criticParser_CLI.py. I am not a programmer, so maybe there is a better way to do it.

First, this section

#!/usr/bin/env python

import codecs
import sys
import os
import re
import argparse
import subprocess

should become

#!/usr/bin/env python

import codecs
import sys

reload(sys)
sys.setdefaultencoding('utf8')

import os
import re
import argparse
import subprocess

Then this section

jq = '''<!DOCTYPE html>
<html>
<head><script src="http://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<title>Critic Markup Output</title>'''

head = '''<!DOCTYPE html>
<html>
<head>
<title>Critic Markup Output</title>'''

Should become

jq = '''<!DOCTYPE html>
<html>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<head><script src="http://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<title>Critic Markup Output</title>'''

head = '''<!DOCTYPE html>
<html>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<head>
<title>Critic Markup Output</title>'''
@teoric
Copy link

teoric commented Jul 7, 2015

This is not only a problem with Cyrillic text but with every text that is not just English or classical Latin (i.e. only uses ASCII). It would be enough to replace open(args.source, "r") with codecs.open(args.source, "r", encoding="UTF-8"), or even add an encoding parameter. This is a little less hacky than sys.setdefaultencoding('utf8').

teoric added a commit to teoric/CriticMarkup-toolkit that referenced this issue Jul 7, 2015
addresses CriticMarkup#33

Just using `open` tends to expect ASCII encoding. UTF-8 seems to be a
more sensible default.

Maybe in the long run, an encoding parameter makes sense?
@teoric teoric linked a pull request Aug 26, 2015 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants