Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big Query Data Samples #948

Merged
merged 1 commit into from
May 10, 2019
Merged

Big Query Data Samples #948

merged 1 commit into from
May 10, 2019

Conversation

emtwo
Copy link

@emtwo emtwo commented May 3, 2019

No description provided.

@emtwo emtwo requested a review from washort May 3, 2019 18:44
@jezdez jezdez requested review from jezdez and removed request for washort May 8, 2019 18:10
Copy link

@jezdez jezdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some Python API design changes needed and a bit more extended BigQuery API use.

if column['type'] == 'RECORD':
for field in column['fields']:
columns.append(u"{}.{}".format(column['name'], field['name']))
col_name = u"{}.{}".format(column['name'], field['name'])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that really a "field name" and not a "column name"?

else:
columns.append(column['name'])
metadata.append({'name': column['name'], 'type': column['type']})

return columns, metadata
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan when a method is extended by returning tuples of values that are related but not quite the same since it makes type checking of the return value harder and encourages wrong data unpacking, or eve more extended return types in the future.

And since it seems this method isn't used elsewhere, can we just make this a _get_column_metadata method that then returns the list of metadata for the given column including the name that _get_columns_schema needs?

In _get_columns_schema the columns value could be extended with columns.extend(map(operator.itemgetter('name'), metadatum)) then. What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea, thanks!

associated_sample = [] if len(samples[i]) == 0 else samples[i][0]

for j, field in enumerate(column['fields']):
col_name = u"{}.{}".format(column['name'], field['name'])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More like field_name not column name here, right?

# the schema provided (i.e their lengths should match up)
for i, column in enumerate(schema):
if column['type'] == 'RECORD':
associated_sample = samples[i]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this line into an else clause below.

for j, field in enumerate(column['fields']):
col_name = u"{}.{}".format(column['name'], field['name'])
samples_dict[col_name] = None
if type(associated_sample) == list and len(associated_sample) > 0:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, not happy about this type check here since it'll be expensive. Could we do this differently?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, also isinstance like below.

service = self._get_bigquery_service()
project_id = self._get_project_id()

dataset_id, table_id = table_name.split('.')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a chance that the provided table name contains more than one dot? Providing a limiting second paramter table_name.split('.', 1) could prevent accidential ValueErrors during unpacking.

tableId=table_id,
).execute()
table_rows = sample_response.get('rows', [])
samples = ({} if len(table_rows) == 0 else table_rows[0]).get('f', [])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind writing this not in shorthand, easier to debug that way.

datasetId=dataset_id,
tableId=table_id,
maxResults=1
).execute()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to use the fields parameter for partial responses and pagination to be a functioning query call, like I did in https://github.com/getredash/redash/pull/3673/files#diff-79a49f870dc6fe9bd78c6c81e5d3b200R267.

The while loop is the pagination, you must request the nextPageToken to get it to work.

Docs for the "partial response": https://developers.google.com/api-client-library/python/guide/performance#partial-response-fields-parameter

API docs for tabledata listing: https://developers.google.com/apis-explorer/#p/bigquery/v2/bigquery.tabledata.list

API docs for table listing: https://developers.google.com/apis-explorer/#p/bigquery/v2/bigquery.tables.list

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I've looked into this a bit and I have a few comments on stuff that works and stuff that I think doesn't quite work:

  • Adding fields to both these API calls seems to work well and makes sense. Thanks for pointing this out!
  • I'm actually not using tables().list() but tables().get() (https://developers.google.com/apis-explorer/#p/bigquery/v2/bigquery.tables.get) which doesn't return or use a nextPageToken I think since it's fetching only 1 table it doesn't need to paginate
  • In terms of my use of tabledata().list(), I've also added a parameter, maxResults=1 since I'm only requesting 1 example for a given table. For this reason, I didn't paginate the result. We could potentially add the pagination as a precaution but as far as I can tell, it's not necessary.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that makes sense, apologies for misreading the API that is used here, my fault.

About pagination, the only risk I see is if for some reason we'd remove the maxResult parameter in the future and would not see that pagination is missing, since the code is not expressive enough to show that. So either just add the while loop or add a comment maybe? Not sure what makes more sense 😬

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'll leave a comment on this.

flattened_samples = self._flatten_samples(samples)
samples_dict = self._columns_and_samples_to_dict(schema, flattened_samples)
return samples_dict
except Exception as e:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be more specific what exceptions could happen here to not accidentally hiding things we aren't expecting. The logger.exception call right now doesn't do much right now since we don't look for it.

}

with patch.object(BigQuery, '_get_bigquery_service') as get_bq_service:
get_bq_service.return_value.tabledata.return_value \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use parentheses instead of backslashes.

@emtwo
Copy link
Author

emtwo commented May 10, 2019

I wanted to make a note here so I don't forget. When processing some queries in the bq samples locally, there was a lot of Access Denied errors that came up. Before merging, we should check in with jason if this was just because of my personal permissions and it should work fine otherwise.

@jezdez
Copy link

jezdez commented May 10, 2019

@emtwo Good idea, where those Access Denied error showing up in the place where exception handling could catch them?

@jezdez
Copy link

jezdez commented May 10, 2019

FWIW, this is good to go with the caveat around pagination.

@emtwo
Copy link
Author

emtwo commented May 10, 2019

@jezdez HttpError catches the access denied issue so it's not something the code can't handle but rather there are a bunch of samples that end up not getting populated. I suppose we can just see how this does in staging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants