dashes in word_count.txt cause errors with WordCount.py #12

HarryCaveMan · 2018-11-04T05:16:14Z

Issue:

Thendash characters in word_count.txt cause an error when following the "Run your first Spark Job" tutorial. There are only two occurences of this character here: "from 1913–74." and here: "near–bankruptcy".

To Recreate:

using spark-2.3.2-bin-hadoop2.7 on Ubuntu18, pyspark/python 2.7, Installed following instructions from lecture 5, go to directory where you cloned python-spark-tutorial and run the following from lecture 6:

spark-submit ./rdd/WordCount.py

The execution halts about halfway through the frequency counter with the following error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position4: ordinal not in range(128)

Spoiler, it's the dash. I'm not sure whether or not the utf16 dash was intentional, so I'm posting.

Work-Around:

I changed the two ndash characters to "from 1913-74." and "near-bankruptcy", which solved the issue for me. Related stackoverflow thread where someone else ran into a similar problem with python2.7 and used the same solution.

The text was updated successfully, but these errors were encountered:

kashikhan1 · 2018-11-24T22:02:01Z

just import on the top resolve the issue
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dashes in word_count.txt cause errors with WordCount.py #12

dashes in word_count.txt cause errors with WordCount.py #12

HarryCaveMan commented Nov 4, 2018 •

edited

Loading

kashikhan1 commented Nov 24, 2018 •

edited

Loading

dashes in word_count.txt cause errors with WordCount.py #12

dashes in word_count.txt cause errors with WordCount.py #12

Comments

HarryCaveMan commented Nov 4, 2018 • edited Loading

Issue:

To Recreate:

Work-Around:

kashikhan1 commented Nov 24, 2018 • edited Loading

HarryCaveMan commented Nov 4, 2018 •

edited

Loading

kashikhan1 commented Nov 24, 2018 •

edited

Loading