Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8 fixes to csv -> hive upload #4488

Merged
merged 1 commit into from
Feb 28, 2018

Conversation

timifasubaa
Copy link
Contributor

@timifasubaa timifasubaa commented Feb 27, 2018

This pr fixes 4 issues.

  1. When users upload utf8 encoded files, the first bytes are often bytes to indicate the utf version being used and this makes hive barf. This removes the encoding before passing it to hive.
  2. When users upload utf8 files, the construction of the part of the query with the column names and types has utf8 errors.
  3. A newer version of hive allows a better way to ignore the first line of the CSV file which contains the column names
  4. I erroneously left airbnb-superset as the default bucket name. Now it's the value specified in the config file.

How was this tested?
Tested on my development machine.

@john-bodley @mistercrunch

@timifasubaa timifasubaa changed the title fixes to csv - hive upload utf8 fixes to csv - hive upload Feb 27, 2018
@timifasubaa timifasubaa changed the title utf8 fixes to csv - hive upload utf8 fixes to csv -> hive upload Feb 27, 2018
@john-bodley
Copy link
Member

I think simply using the unicodecsv package rather than csv would remedy this issue.

@timifasubaa timifasubaa force-pushed the upload_csv_fixes branch 2 times, most recently from c02fbd7 to 5f991db Compare February 27, 2018 01:28
@timifasubaa
Copy link
Contributor Author

@john-bodley Done. Good call.

@timifasubaa timifasubaa force-pushed the upload_csv_fixes branch 3 times, most recently from 8dc9d42 to 42f4ffa Compare February 27, 2018 01:51
@@ -868,16 +868,17 @@ def get_column_names(filepath):
secure_filename(form.csv_file.data.filename)
column_names = get_column_names(upload_path)
schema_definition = ', '.join(
[s + ' STRING ' for s in column_names])
[s.decode('utf-8') + ' STRING ' for s in column_names])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to use decode here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were right. It is no longer needed after using unicodecsv. Thanks!


s3 = boto3.client('s3')
location = os.path.join('s3a://', bucket_path, upload_prefix, table_name)
s3.upload_file(
upload_path, 'airbnb-superset',
upload_path, bucket_path,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch.

@mistercrunch
Copy link
Member

LGTM

@graceguo-supercat graceguo-supercat merged commit 404e2d5 into apache:master Feb 28, 2018
michellethomas pushed a commit to michellethomas/panoramix that referenced this pull request May 24, 2018
wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018
os.path.join(upload_prefix, table_name, filename))
sql = """CREATE EXTERNAL TABLE {table_name} ( {schema_definition} )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS
TEXTFILE LOCATION '{location}'""".format(**locals())
TEXTFILE LOCATION '{location}'
tblproperties ('skip.header.line.count'='1')""".format(**locals())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the skip.header.line.count property will only work in Presto for v0.199 or later per here.

@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.24.0 labels Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.24.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants