Metadata.reflect() fails if user does not have access to all datasets/tables in project. #838

jlynchMicron · 2023-02-16T16:59:39Z

Thanks for stopping by to let us know something could be better!

PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.

Please run down the following list and make sure you've tried the usual "quick fixes":

Search the issues already opened: https://github.com/googleapis/python-bigquery-sqlalchemy/issues
Search StackOverflow: https://stackoverflow.com/questions/tagged/google-cloud-platform+python

If you are still having issues, please be sure to include as much information as possible:

Filed this issue on main SQLalchemy Git project, but they think this is a BigQuery driver issue:
sqlalchemy/sqlalchemy#9319

Environment details

OS type and version: Linux CentOS 7
Python version: 3.10
sqlalchemy-bigquery version: 1.5.0

Steps to reproduce

Metadata.reflect function tries to get all table names in engine before using the "only" table list. This can cause issues if a user does not have read permissions on all tables/datasets under engine, and results in a permission denied error.

Check for all table names: https://github.com/sqlalchemy/sqlalchemy/blob/main/lib/sqlalchemy/sql/schema.py#L5487
Check for "only" tables after: https://github.com/sqlalchemy/sqlalchemy/blob/main/lib/sqlalchemy/sql/schema.py#L5507

Desired behavior is to only check on tables explicitly listed with "only" argument.

Code example

Simple example similar to my problem. Engine is configured to a certain BQ project. My "only" table list has multiple tables coming from different datasets withing the project. I also do not have permission to all tables/datasets within the project.

engine = sa.create_engine('bigquery://' + bq_project)
tbl_only_list = ['datasetA.tbl1', 'datasetB.tbl2']
metadata.reflect(engine, autoload_replace=False, extend_existing=True, only=tbl_only_list, views=True)

Stack trace

Partially redacted stack and error message.

Permission bigquery.tables.list denied on dataset ************* (or it may not exist).

  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/google/cloud/_http/__init__.py", line 494, in api_request
    raise exceptions.from_http_response(response)
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/google/api_core/retry.py", line 191, in retry_target
    return target()
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/google/api_core/retry.py", line 349, in retry_wrapped_func
    return retry_target(
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/google/cloud/bigquery/client.py", line 789, in _call_api
    return call()
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/google/cloud/bigquery/client.py", line 1532, in api_request
    return self._call_api(
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/google/api_core/page_iterator.py", line 432, in _get_next_page_response
    return self.api_request(
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/google/api_core/page_iterator.py", line 373, in _next_page
    response = self._get_next_page_response()
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/google/api_core/page_iterator.py", line 244, in _page_iter
    page = self._next_page()
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/google/api_core/page_iterator.py", line 208, in _items_iter
    for page in self._page_iter(increment=False):
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/sqlalchemy_bigquery/base.py", line 857, in _get_table_or_view_names
    for table in tables:
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/sqlalchemy_bigquery/base.py", line 1012, in get_table_names
    return self._get_table_or_view_names(connection, item_types, schema)
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/sqlalchemy/engine/reflection.py", line 266, in get_table_names
    return self.dialect.get_table_names(
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/sqlalchemy/sql/schema.py", line 4859, in reflect
    available = util.OrderedSet(insp.get_table_names(schema))

Making sure to follow these steps will guarantee the quickest resolution possible.

Thanks!

The text was updated successfully, but these errors were encountered:

jlynchMicron · 2023-02-16T17:03:24Z

Please note, I can also confirm that the dataset redacted in the error message does exist, I just dont have permissions to it. This dataset is not described in my "only" table list, so I should not need to have permissions to this dataset.

chalmerlowe · 2023-06-16T12:59:26Z

Based on my understanding of the code in sqlalchemy, it is sqlalchemy that should be doing the filtering to determine which tables to request for reflection.

We should only be getting a filtered list of tables that they request.

In the section above, there are several snippets of code that look like this one that basically parse the current list of all tables and then if extend_existing is True or if a given table is not already in the list of current tables will add certain table names to the load list. With the final result being: a list of filtered table names to send on for reflection.

load = [
    name
    for name in only
    if extend_existing or name not in current
]

It is not clear how our code gets an unfiltered list of tables but I recommend that you start by looking at the sqlalchemy code to ensure that their logic is correct and whether it really is sending out a filtered list.

Gonna move this down to a P3, unless we get feedback that sqlalchemy is performing as expected.

chalmerlowe · 2023-06-23T11:00:00Z

In the absence of additional evidence that this is truly an issue with python-bigquery-sqlalchemy, gonna close this issue.

ADR-007 · 2024-05-09T19:44:05Z

You can specify schema parameter, to read this schema only, but then it will fail because this bigquery driver will return <dataset_id>.<table_name> instead of just <table_name> and SQLAlchemy will add schema prefix one more time.

# MetaData.reflect(
# ...
            # insert this to fix:
            # available = util.OrderedSet({v.split('.', 1)[-1] for v in available})

            if schema is not None:
                available_w_schema: util.OrderedSet[str] = util.OrderedSet(
                    [f"{schema}.{name}" for name in available]
                )
            else:
                available_w_schema = available

But table <dataset_id>.<dataset_id>.<table_name> doesn't exists :)

What is bad on SqlAlchemy side - it forces dialect to load all table names in that schema.

jlynchMicron · 2024-05-10T16:49:28Z

You can specify schema parameter, to read this schema only, but then it will fail because this bigquery driver will return <dataset_id>.<table_name> instead of just <table_name> and SQLAlchemy will add schema prefix one more time.
# MetaData.reflect(
# ...
            # insert this to fix:
            # available = util.OrderedSet({v.split('.', 1)[-1] for v in available})

            if schema is not None:
                available_w_schema: util.OrderedSet[str] = util.OrderedSet(
                    [f"{schema}.{name}" for name in available]
                )
            else:
                available_w_schema = available
But table <dataset_id>.<dataset_id>.<table_name> doesn't exists :)

What is bad on SqlAlchemy side - it forces dialect to load all table names in that schema.

Yeah, I tried playing around with the use of schema about a month ago and I could not get it working with a <projectt_id>.<dataset_id> to prefix <table_name> at all.

JacobHayes · 2024-06-27T02:56:23Z

You can specify schema parameter, to read this schema only, but then it will fail because this bigquery driver will return <dataset_id>.<table_name> instead of just <table_name> and SQLAlchemy will add schema prefix one more time.

The BigQuery dialect is actually violating the interface here I think - all of the other dialects return only the table name from get_table_names and get_view_names, which is why SQLAlchemy adds the schema.

schema=None to get_{table,view}_names is supposed to mean "search the default schema", not "search all schemas", but the BigQuery dialect searches all (likely because there isn't always a "default" schema, if it's not set on the connection url).

This part is a slightly separate from the OP though, so I'll open another issue for it.

This also wreaks havoc when using alembic revision --autogenerate without a default dataset (ie: schema is set on table object). The python models have a proper schema table arg and a bare table name, but the reflected ones end up with both a schema arg and a {schema}.{table} table name (try with a breakpoint() just above here).

JacobHayes · 2024-06-27T03:44:55Z

In the absence of additional evidence that this is truly an issue with python-bigquery-sqlalchemy, gonna close this issue.

The issue is that the BigQueryDialect.get_{table,view}_names methods call list_dataset in the _get_table_or_view_names helper:

python-bigquery-sqlalchemy/sqlalchemy_bigquery/base.py

Lines 1103 to 1112 in 155e5b0

    
           def _get_table_or_view_names(self, connection, item_types, schema=None): 
        
               current_schema = schema or self.dataset_id 
        
               get_table_name = ( 
        
                   self._build_formatted_table_id 
        
                   if self.dataset_id is None 
        
                   else operator.attrgetter("table_id") 
        
               ) 
        
               client = connection.connection._client 
        
               datasets = client.list_datasets()

Those methods are only supposed to operate on a single schema at a time iiuc, so this should be fixed in #1089, but the OP would need to still call .reflect and set schema= for each one allowed.

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API. label Feb 16, 2023

meredithslota added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Mar 2, 2023

chalmerlowe added priority: p3 Desirable enhancement or fix. May not be included in next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Jun 16, 2023

chalmerlowe closed this as completed Jun 23, 2023

This was referenced Jun 27, 2024

Buggy metadata reflection without a default dataset #1088

Open

fix: Fix metadata reflection without a default dataset #1089

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata.reflect() fails if user does not have access to all datasets/tables in project. #838

Metadata.reflect() fails if user does not have access to all datasets/tables in project. #838

jlynchMicron commented Feb 16, 2023

jlynchMicron commented Feb 16, 2023

chalmerlowe commented Jun 16, 2023

chalmerlowe commented Jun 23, 2023

ADR-007 commented May 9, 2024 •

edited

Loading

jlynchMicron commented May 10, 2024

JacobHayes commented Jun 27, 2024 •

edited

Loading

JacobHayes commented Jun 27, 2024 •

edited

Loading

Metadata.reflect() fails if user does not have access to all datasets/tables in project. #838

Metadata.reflect() fails if user does not have access to all datasets/tables in project. #838

Comments

jlynchMicron commented Feb 16, 2023

Environment details

Steps to reproduce

Code example

Stack trace

jlynchMicron commented Feb 16, 2023

chalmerlowe commented Jun 16, 2023

chalmerlowe commented Jun 23, 2023

ADR-007 commented May 9, 2024 • edited Loading

jlynchMicron commented May 10, 2024

JacobHayes commented Jun 27, 2024 • edited Loading

JacobHayes commented Jun 27, 2024 • edited Loading

ADR-007 commented May 9, 2024 •

edited

Loading

JacobHayes commented Jun 27, 2024 •

edited

Loading

JacobHayes commented Jun 27, 2024 •

edited

Loading