-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery: Fix bug where load_table_from_dataframe
could not append to REQUIRED fields.
#8230
BigQuery: Fix bug where load_table_from_dataframe
could not append to REQUIRED fields.
#8230
Conversation
…D fields. If a BigQuery schema is supplied as part of the `job_config`, it can be used to set the `nullable` bit correctly on the serialized parquet file.
load_table_from_dataframe
could not append to REQUIRED fields.load_table_from_dataframe
could not append to REQUIRED fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I figured out that the example in the issue description does not hit the to_parquet()
line, because job_config.schema
is None
. Will try to figure out how to set that.
(disclaimer: my BQ knowledge is very limited)
Non-essential remark aside, the code changes look good to me all in all. I had some trouble verifying the fix, though.
I was able to reproduce the issue following the steps from description (had to switch "foo" and "bar" in the second-to-last line). When testing it again on the PR branch, however, the issue persisted, I again got the same error.
What could I be missing?
FWIW, I did make sure to re-install the bigquery library after pulling the PR code:
(venv-3.6) peter@black-box:~/workspace/google-cloud-python/bigquery (pr_temp)$ pip install -e .
arrow_names.append(bq_field.name) | ||
arrow_arrays.append(bq_to_arrow_array(dataframe[bq_field.name], bq_field)) | ||
|
||
arrow_table = pyarrow.Table.from_arrays(arrow_arrays, names=arrow_names) | ||
if all((field is not None for field in arrow_fields)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor)
As a sole argument, the generator expression does not have to be enclosed in an extra pair of parentheses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update 2: I changed the last line of the example from the issue description to the following:
from google.cloud.bigquery import job
job_config = job.LoadJobConfig(schema=schema)
client.load_table_from_dataframe(
df, table_ref, job_config=job_config
).result()
The error I then got was different, but seemed similar to the original one:
google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: Provided schema is not compatible with the file 'prod-scotty-8efadb65-d51b-44ba-bfec-cf98d1e93934'. Field 'bar' is specified as REQUIRED in provided schema which does not match NULLABLE as specified in the file.
When I ran the modified example with the PR fix, the error disappeared. Seems like the fix works (and the new code path was indeed taken).
Based on my limited BQ knowledge, the fix seems to work and the code looks good, but I will wait with merging, since @shollyman might have something more to add. (if not, then please feel free to go ahead and merge it) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this.
If a BigQuery schema is supplied as part of the
job_config
, it can beused to set the
nullable
bit correctly on the serialized parquet file.Closes #8093.