Big Query Data Samples #948

emtwo · 2019-05-03T16:01:20Z

No description provided.

jezdez

Some Python API design changes needed and a bit more extended BigQuery API use.

jezdez · 2019-05-08T18:16:09Z

redash/query_runner/big_query.py

        if column['type'] == 'RECORD':
            for field in column['fields']:
-                columns.append(u"{}.{}".format(column['name'], field['name']))
+                col_name = u"{}.{}".format(column['name'], field['name'])


Isn't that really a "field name" and not a "column name"?

jezdez · 2019-05-08T18:25:38Z

redash/query_runner/big_query.py

        else:
            columns.append(column['name'])
+            metadata.append({'name': column['name'], 'type': column['type']})
+
+        return columns, metadata


I'm not a huge fan when a method is extended by returning tuples of values that are related but not quite the same since it makes type checking of the return value harder and encourages wrong data unpacking, or eve more extended return types in the future.

And since it seems this method isn't used elsewhere, can we just make this a _get_column_metadata method that then returns the list of metadata for the given column including the name that _get_columns_schema needs?

In _get_columns_schema the columns value could be extended with columns.extend(map(operator.itemgetter('name'), metadatum)) then. What do you think?

I like this idea, thanks!

jezdez · 2019-05-08T18:26:54Z

redash/query_runner/big_query.py

+                    associated_sample = [] if len(samples[i]) == 0 else samples[i][0]
+
+                for j, field in enumerate(column['fields']):
+                    col_name = u"{}.{}".format(column['name'], field['name'])


More like field_name not column name here, right?

jezdez · 2019-05-08T18:27:48Z

redash/query_runner/big_query.py

+        # the schema provided (i.e their lengths should match up)
+        for i, column in enumerate(schema):
+            if column['type'] == 'RECORD':
+                associated_sample = samples[i]


Let's move this line into an else clause below.

jezdez · 2019-05-08T18:29:18Z

redash/query_runner/big_query.py

+                for j, field in enumerate(column['fields']):
+                    col_name = u"{}.{}".format(column['name'], field['name'])
+                    samples_dict[col_name] = None
+                    if type(associated_sample) == list and len(associated_sample) > 0:


Hmmm, not happy about this type check here since it'll be expensive. Could we do this differently?

Oh, also isinstance like below.

jezdez · 2019-05-08T18:33:49Z

redash/query_runner/big_query.py

+        service = self._get_bigquery_service()
+        project_id = self._get_project_id()
+
+        dataset_id, table_id = table_name.split('.')


Is there a chance that the provided table name contains more than one dot? Providing a limiting second paramter table_name.split('.', 1) could prevent accidential ValueErrors during unpacking.

jezdez · 2019-05-08T18:34:50Z

redash/query_runner/big_query.py

+                tableId=table_id,
+            ).execute()
+            table_rows = sample_response.get('rows', [])
+            samples = ({} if len(table_rows) == 0 else table_rows[0]).get('f', [])


Mind writing this not in shorthand, easier to debug that way.

jezdez · 2019-05-08T18:40:18Z

redash/query_runner/big_query.py

+                datasetId=dataset_id,
+                tableId=table_id,
+                maxResults=1
+            ).execute()


This needs to use the fields parameter for partial responses and pagination to be a functioning query call, like I did in https://github.com/getredash/redash/pull/3673/files#diff-79a49f870dc6fe9bd78c6c81e5d3b200R267.

The while loop is the pagination, you must request the nextPageToken to get it to work.

Docs for the "partial response": https://developers.google.com/api-client-library/python/guide/performance#partial-response-fields-parameter

API docs for tabledata listing: https://developers.google.com/apis-explorer/#p/bigquery/v2/bigquery.tabledata.list

API docs for table listing: https://developers.google.com/apis-explorer/#p/bigquery/v2/bigquery.tables.list

Hm, I've looked into this a bit and I have a few comments on stuff that works and stuff that I think doesn't quite work:

Adding fields to both these API calls seems to work well and makes sense. Thanks for pointing this out!

I'm actually not using tables().list() but tables().get() (https://developers.google.com/apis-explorer/#p/bigquery/v2/bigquery.tables.get) which doesn't return or use a nextPageToken I think since it's fetching only 1 table it doesn't need to paginate

In terms of my use of tabledata().list(), I've also added a parameter, maxResults=1 since I'm only requesting 1 example for a given table. For this reason, I didn't paginate the result. We could potentially add the pagination as a precaution but as far as I can tell, it's not necessary.

Okay, that makes sense, apologies for misreading the API that is used here, my fault.

About pagination, the only risk I see is if for some reason we'd remove the maxResult parameter in the future and would not see that pagination is missing, since the code is not expressive enough to show that. So either just add the while loop or add a comment maybe? Not sure what makes more sense 😬

Agreed, I'll leave a comment on this.

jezdez · 2019-05-08T18:41:51Z

redash/query_runner/big_query.py

+            flattened_samples = self._flatten_samples(samples)
+            samples_dict = self._columns_and_samples_to_dict(schema, flattened_samples)
+            return samples_dict
+        except Exception as e:


Let's be more specific what exceptions could happen here to not accidentally hiding things we aren't expecting. The logger.exception call right now doesn't do much right now since we don't look for it.

jezdez · 2019-05-08T18:42:20Z

tests/query_runner/test_bigquery.py

+		}
+
+		with patch.object(BigQuery, '_get_bigquery_service') as get_bq_service:
+			get_bq_service.return_value.tabledata.return_value \


Let's use parentheses instead of backslashes.

emtwo · 2019-05-10T16:03:15Z

I wanted to make a note here so I don't forget. When processing some queries in the bq samples locally, there was a lot of Access Denied errors that came up. Before merging, we should check in with jason if this was just because of my personal permissions and it should work fine otherwise.

jezdez · 2019-05-10T16:06:49Z

@emtwo Good idea, where those Access Denied error showing up in the place where exception handling could catch them?

jezdez · 2019-05-10T16:07:17Z

FWIW, this is good to go with the caveat around pagination.

emtwo · 2019-05-10T16:13:28Z

@jezdez HttpError catches the access denied issue so it's not something the code can't handle but rather there are a bunch of samples that end up not getting populated. I suppose we can just see how this does in staging.

emtwo force-pushed the emtwo/bq_schema branch from 688e19a to b30d6fe Compare May 3, 2019 17:26

emtwo requested a review from washort May 3, 2019 18:44

jezdez requested review from jezdez and removed request for washort May 8, 2019 18:10

jezdez reviewed May 8, 2019

View reviewed changes

Adding BigQuery schema drawer with data types and samples.

2161d73

emtwo force-pushed the emtwo/bq_schema branch from 9e192dc to 2161d73 Compare May 10, 2019 16:38

emtwo merged commit 086533b into master May 10, 2019

jezdez deleted the emtwo/bq_schema branch August 2, 2019 09:21

snyk-bot mentioned this pull request Aug 17, 2021

[Snyk] Fix for 15 vulnerabilities MaxMood96/redash#8

Open

MaxMood96 mentioned this pull request May 14, 2022

[Snyk] Fix for 1 vulnerabilities MaxMood96/redash#30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big Query Data Samples #948

Big Query Data Samples #948

emtwo commented May 3, 2019

jezdez left a comment

jezdez May 8, 2019

jezdez May 8, 2019

emtwo May 8, 2019

jezdez May 8, 2019

jezdez May 8, 2019

jezdez May 8, 2019

jezdez May 8, 2019

jezdez May 8, 2019

jezdez May 8, 2019

jezdez May 8, 2019

emtwo May 9, 2019

jezdez May 10, 2019

emtwo May 10, 2019

jezdez May 8, 2019

jezdez May 8, 2019

emtwo commented May 10, 2019

jezdez commented May 10, 2019

jezdez commented May 10, 2019

emtwo commented May 10, 2019

Big Query Data Samples #948

Big Query Data Samples #948

Conversation

emtwo commented May 3, 2019

jezdez left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emtwo commented May 10, 2019

jezdez commented May 10, 2019

jezdez commented May 10, 2019

emtwo commented May 10, 2019