Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dashes in word_count.txt cause errors with WordCount.py #12

Open
HarryCaveMan opened this issue Nov 4, 2018 · 1 comment
Open

dashes in word_count.txt cause errors with WordCount.py #12

HarryCaveMan opened this issue Nov 4, 2018 · 1 comment

Comments

@HarryCaveMan
Copy link

HarryCaveMan commented Nov 4, 2018

Issue:

Thendash characters in word_count.txt cause an error when following the "Run your first Spark Job" tutorial. There are only two occurences of this character here: "from 1913–74." and here: "near–bankruptcy".

To Recreate:

using spark-2.3.2-bin-hadoop2.7 on Ubuntu18, pyspark/python 2.7, Installed following instructions from lecture 5, go to directory where you cloned python-spark-tutorial and run the following from lecture 6:

spark-submit ./rdd/WordCount.py

The execution halts about halfway through the frequency counter with the following error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position4: ordinal not in range(128)

Spoiler, it's the dash. I'm not sure whether or not the utf16 dash was intentional, so I'm posting.

Work-Around:

I changed the two ndash characters to "from 1913-74." and "near-bankruptcy", which solved the issue for me. Related stackoverflow thread where someone else ran into a similar problem with python2.7 and used the same solution.

@kashikhan1
Copy link

kashikhan1 commented Nov 24, 2018

just import on the top resolve the issue
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants