Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: DATETIME columns invalid when uploaded with load_table_from_dataframe #9996

Closed
tswast opened this issue Dec 18, 2019 · 19 comments · Fixed by #10028
Closed

BigQuery: DATETIME columns invalid when uploaded with load_table_from_dataframe #9996

tswast opened this issue Dec 18, 2019 · 19 comments · Fixed by #10028
Assignees
Labels
api: bigquery Issues related to the BigQuery API. backend priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@tswast
Copy link
Contributor

tswast commented Dec 18, 2019

https://github.com/googleapis/google-cloud-python/issues/9920#issuecomment-566642042

@tswast tswast added api: bigquery Issues related to the BigQuery API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Dec 18, 2019
@tswast
Copy link
Contributor Author

tswast commented Dec 18, 2019

See:

and

return pyarrow.timestamp("us", tz=None)

The fix is likely to change the pyarrow type to ns to match pandas behavior. We'll definitely want a system test to verify the fix.

@tswast
Copy link
Contributor Author

tswast commented Dec 18, 2019

One thing to consider: the full range of DATETIME is not supported when nanosecond precision is used.

We need to check the input precision and make sure it matches the Arrow precision we use.

@emkornfield
Copy link
Contributor

emkornfield commented Dec 20, 2019

@tswast I think the short term fix would be to change datetime64 to TIMESTAMP instead of DATETIME

@HemangChothani
Copy link
Contributor

@tseaver pyarrow.timestamp("ns", tz=None) doesn't help to resolve the issue, I've also tried with manually generated files but unable to find the resolution, I'm not sure but there may be something related to the backend as well?

@emkornfield
Copy link
Contributor

@HemangChothani Can you try changing

to be TIMESTAMP.

@HemangChothani
Copy link
Contributor

HemangChothani commented Dec 20, 2019 via email

@emkornfield
Copy link
Contributor

I'll let others chime in, but in the short term I think this might be a reasonable fix. The main downside is that it doesn't allow round-tripping "datetime64[ns]", but at least data is preserved.

@tswast
Copy link
Contributor Author

tswast commented Dec 22, 2019

If you manually specify TIMESTAMP in the job_config, does that work?

I’d prefer to keep time zone less columns mapping to DATETIME and fix the conversion.

@HemangChothani
Copy link
Contributor

@tswast Yes it works if i manually specify TIMESTAMP in the job_config.

@tswast
Copy link
Contributor Author

tswast commented Dec 23, 2019

OK, thanks. Since we have a workaround, I'll drop the priority to P2.

@tswast tswast added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Dec 23, 2019
@tswast
Copy link
Contributor Author

tswast commented Dec 30, 2019

For system tests of this fix, I'd like to see supported types added to https://github.com/googleapis/google-cloud-python/blob/master/bigquery/samples/load_table_dataframe.py and the tests for that sample updated to read and verify the table data.

I'll work on updating the tests for currently-supported data types.

@tswast
Copy link
Contributor Author

tswast commented Dec 30, 2019

Created draft PR #10028 to hopefully reproduce this issue in a system test.

@tswast
Copy link
Contributor Author

tswast commented Jan 2, 2020

I see in the unit tests for this that uploading DATETIME columns was blocked by https://issues.apache.org/jira/browse/ARROW-5450, so once this issue is fixed, we'll require the latest version of pyarrow to upload DATETIME columns.

@tswast tswast assigned tswast and unassigned HemangChothani Jan 2, 2020
@tswast
Copy link
Contributor Author

tswast commented Jan 2, 2020

I investigated this further by capturing the Parquet file that is generated by _pandas_helpers.dataframe_to_parquet.

client.py

            if pyarrow and job_config.schema:
                if parquet_compression == "snappy":  # adjust the default value
                    parquet_compression = parquet_compression.upper()

                _pandas_helpers.dataframe_to_parquet(
                    dataframe,
                    job_config.schema,
                    tmppath,
                    parquet_compression=parquet_compression,
                )
                import shutil
                shutil.copyfile(tmppath, "/Users/swast/Desktop/dataframe.parquet")
            else:

dataframe.parquet.zip

When I upload dataframe.parquet via the web interface, I'm not able to reproduce the issue. There must be something different with how we upload the data file from Python.

@tswast
Copy link
Contributor Author

tswast commented Jan 2, 2020

I'm able to reproduce this with the command-line:

bq load --source_format=PARQUET --schema=wikidata_id:STRING,title:STRING,release_year:INTEGER,length_minutes:FLOAT,release_date:TIMESTAMP,dvd_release:DATETIME my_dataset.monty_python dataframe.parquet

@tswast tswast added the backend label Jan 2, 2020
@tswast
Copy link
Contributor Author

tswast commented Jan 2, 2020

Since I'm able to reproduce this error with the command-line, I believe it is a backend issue. I've filed internal bug 147108331 for the backend engineers to investigate.

@tswast
Copy link
Contributor Author

tswast commented Jan 2, 2020

I was able to create a Parquet file with nanosecond-precision DATETIME columns, but the backend still multiplies the values by 1000. It seems we can't write DATETIME columns from Parquet without a backend fix.

dataframe_nanos.parquet.zip

@tswast
Copy link
Contributor Author

tswast commented Jan 2, 2020

After all this, I think the best option is to write datetime[ns] dtype columns as TIMESTAMP, since I'm not able to write correct DATETIME columns with parquet files. I've updated my draft PR with this workaround.

@tswast
Copy link
Contributor Author

tswast commented Jan 8, 2020

Talked with the backend engineers. DATETIME is not supported for uploading from Parquet (treating my request as a feature request), so changing the dtype to TIMESTAMP is the correct solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. backend priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
3 participants