Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: resolve divide by 0 error when uploading empty dataframe #252

Merged
merged 11 commits into from
Feb 26, 2019
4 changes: 3 additions & 1 deletion docs/source/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ Changelog
0.10.0 / TBD
------------

- This fixes a bug where pandas-gbq could not upload an empty database. (:issue:`237`)

Dependency updates
~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -235,4 +237,4 @@ Initial release of transfered code from `pandas <https://github.com/pandas-dev/p
Includes patches since the 0.19.2 release on pandas with the following:

- :func:`read_gbq` now allows query configuration preferences `pandas-GH#14742 <https://github.com/pandas-dev/pandas/pull/14742>`__
- :func:`read_gbq` now stores ``INTEGER`` columns as ``dtype=object`` if they contain ``NULL`` values. Otherwise they are stored as ``int64``. This prevents precision lost for integers greather than 2**53. Furthermore ``FLOAT`` columns with values above 10**4 are no longer casted to ``int64`` which also caused precision loss `pandas-GH#14064 <https://github.com/pandas-dev/pandas/pull/14064>`__, and `pandas-GH#14305 <https://github.com/pandas-dev/pandas/pull/14305>`__
- :func:`read_gbq` now stores ``INTEGER`` columns as ``dtype=object`` if they contain ``NULL`` values. Otherwise they are stored as ``int64``. This prevents precision lost for integers greather than 2**53. Furthermore ``FLOAT`` columns with values above 10**4 are no longer casted to ``int64`` which also caused precision loss `pandas-GH#14064 <https://github.com/pandas-dev/pandas/pull/14064>`__, and `pandas-GH#14305 <https://github.com/pandas-dev/pandas/pull/14305>`__
4 changes: 2 additions & 2 deletions pandas_gbq/gbq.py
Original file line number Diff line number Diff line change
Expand Up @@ -518,8 +518,8 @@ def load_data(
chunks = tqdm.tqdm(chunks)
for remaining_rows in chunks:
logger.info(
"\rLoad is {0}% Complete".format(
((total_rows - remaining_rows) * 100) / total_rows
"\r{} out of {} rows loaded.".format(
total_rows - remaining_rows, total_rows
)
)
except self.http_error as ex:
Expand Down
22 changes: 22 additions & 0 deletions tests/system/test_gbq.py
Original file line number Diff line number Diff line change
Expand Up @@ -924,6 +924,28 @@ def test_upload_data(self, project_id):
)
assert result["num_rows"][0] == test_size

def test_upload_empty_data(self, project_id):
test_id = "data_with_0_rows"
test_size = 0
df = DataFrame()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we might have an additional problem when the DataFrame contains no columns.

In the conda build (https://circleci.com/gh/tswast/pandas-gbq/276) I'm getting:

E           google.api_core.exceptions.BadRequest: 400 POST https://www.googleapis.com/upload/bigquery/v2/projects/pandas-gbq-tests/jobs?uploadType=resumable: Empty schema specified for the load job. Please specify a schema that describes the data being loaded.

Since we still create a table in pandas-gbq before running the load job, we can probably avoid doing the load job altogether when a DataFrame contains no rows.


gbq.to_gbq(
df,
self.destination_table + test_id,
project_id,
credentials=self.credentials,
)

result = gbq.read_gbq(
"SELECT COUNT(*) AS num_rows FROM {0}".format(
self.destination_table + test_id
),
project_id=project_id,
credentials=self.credentials,
dialect="legacy",
)
assert result["num_rows"][0] == test_size

def test_upload_data_if_table_exists_fail(self, project_id):
test_id = "2"
test_size = 10
Expand Down