Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata.reflect() fails if user does not have access to all datasets/tables in project. #838

Closed
jlynchMicron opened this issue Feb 16, 2023 · 7 comments · May be fixed by #1089
Closed
Labels
api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@jlynchMicron
Copy link

Thanks for stopping by to let us know something could be better!

PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.

Please run down the following list and make sure you've tried the usual "quick fixes":

If you are still having issues, please be sure to include as much information as possible:

Filed this issue on main SQLalchemy Git project, but they think this is a BigQuery driver issue:
sqlalchemy/sqlalchemy#9319

Environment details

  • OS type and version: Linux CentOS 7
  • Python version: 3.10
  • sqlalchemy-bigquery version: 1.5.0

Steps to reproduce

Metadata.reflect function tries to get all table names in engine before using the "only" table list. This can cause issues if a user does not have read permissions on all tables/datasets under engine, and results in a permission denied error.

Check for all table names: https://github.com/sqlalchemy/sqlalchemy/blob/main/lib/sqlalchemy/sql/schema.py#L5487
Check for "only" tables after: https://github.com/sqlalchemy/sqlalchemy/blob/main/lib/sqlalchemy/sql/schema.py#L5507

Desired behavior is to only check on tables explicitly listed with "only" argument.

Code example

Simple example similar to my problem. Engine is configured to a certain BQ project. My "only" table list has multiple tables coming from different datasets withing the project. I also do not have permission to all tables/datasets within the project.

engine = sa.create_engine('bigquery://' + bq_project)
tbl_only_list = ['datasetA.tbl1', 'datasetB.tbl2']
metadata.reflect(engine, autoload_replace=False, extend_existing=True, only=tbl_only_list, views=True)

Stack trace

Partially redacted stack and error message.

Permission bigquery.tables.list denied on dataset ************* (or it may not exist).

  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/google/cloud/_http/__init__.py", line 494, in api_request
    raise exceptions.from_http_response(response)
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/google/api_core/retry.py", line 191, in retry_target
    return target()
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/google/api_core/retry.py", line 349, in retry_wrapped_func
    return retry_target(
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/google/cloud/bigquery/client.py", line 789, in _call_api
    return call()
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/google/cloud/bigquery/client.py", line 1532, in api_request
    return self._call_api(
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/google/api_core/page_iterator.py", line 432, in _get_next_page_response
    return self.api_request(
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/google/api_core/page_iterator.py", line 373, in _next_page
    response = self._get_next_page_response()
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/google/api_core/page_iterator.py", line 244, in _page_iter
    page = self._next_page()
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/google/api_core/page_iterator.py", line 208, in _items_iter
    for page in self._page_iter(increment=False):
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/sqlalchemy_bigquery/base.py", line 857, in _get_table_or_view_names
    for table in tables:
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/sqlalchemy_bigquery/base.py", line 1012, in get_table_names
    return self._get_table_or_view_names(connection, item_types, schema)
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/sqlalchemy/engine/reflection.py", line 266, in get_table_names
    return self.dialect.get_table_names(
  File "/home/jlynch/miniconda3/envs/telem_dev/lib/python3.10/site-packages/sqlalchemy/sql/schema.py", line 4859, in reflect
    available = util.OrderedSet(insp.get_table_names(schema))

Making sure to follow these steps will guarantee the quickest resolution possible.

Thanks!

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API. label Feb 16, 2023
@jlynchMicron
Copy link
Author

Please note, I can also confirm that the dataset redacted in the error message does exist, I just dont have permissions to it. This dataset is not described in my "only" table list, so I should not need to have permissions to this dataset.

@meredithslota meredithslota added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Mar 2, 2023
@chalmerlowe
Copy link
Collaborator

Based on my understanding of the code in sqlalchemy, it is sqlalchemy that should be doing the filtering to determine which tables to request for reflection.

We should only be getting a filtered list of tables that they request.

In the section above, there are several snippets of code that look like this one that basically parse the current list of all tables and then if extend_existing is True or if a given table is not already in the list of current tables will add certain table names to the load list. With the final result being: a list of filtered table names to send on for reflection.

load = [
    name
    for name in only
    if extend_existing or name not in current
]

It is not clear how our code gets an unfiltered list of tables but I recommend that you start by looking at the sqlalchemy code to ensure that their logic is correct and whether it really is sending out a filtered list.

Gonna move this down to a P3, unless we get feedback that sqlalchemy is performing as expected.

@chalmerlowe chalmerlowe added priority: p3 Desirable enhancement or fix. May not be included in next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Jun 16, 2023
@chalmerlowe
Copy link
Collaborator

In the absence of additional evidence that this is truly an issue with python-bigquery-sqlalchemy, gonna close this issue.

@ADR-007
Copy link

ADR-007 commented May 9, 2024

You can specify schema parameter, to read this schema only, but then it will fail because this bigquery driver will return <dataset_id>.<table_name> instead of just <table_name> and SQLAlchemy will add schema prefix one more time.

# MetaData.reflect(
# ...
            # insert this to fix:
            # available = util.OrderedSet({v.split('.', 1)[-1] for v in available})

            if schema is not None:
                available_w_schema: util.OrderedSet[str] = util.OrderedSet(
                    [f"{schema}.{name}" for name in available]
                )
            else:
                available_w_schema = available

But table <dataset_id>.<dataset_id>.<table_name> doesn't exists :)

What is bad on SqlAlchemy side - it forces dialect to load all table names in that schema.

@jlynchMicron
Copy link
Author

You can specify schema parameter, to read this schema only, but then it will fail because this bigquery driver will return <dataset_id>.<table_name> instead of just <table_name> and SQLAlchemy will add schema prefix one more time.

# MetaData.reflect(
# ...
            # insert this to fix:
            # available = util.OrderedSet({v.split('.', 1)[-1] for v in available})

            if schema is not None:
                available_w_schema: util.OrderedSet[str] = util.OrderedSet(
                    [f"{schema}.{name}" for name in available]
                )
            else:
                available_w_schema = available

But table <dataset_id>.<dataset_id>.<table_name> doesn't exists :)

What is bad on SqlAlchemy side - it forces dialect to load all table names in that schema.

Yeah, I tried playing around with the use of schema about a month ago and I could not get it working with a <projectt_id>.<dataset_id> to prefix <table_name> at all.

@JacobHayes
Copy link
Contributor

JacobHayes commented Jun 27, 2024

You can specify schema parameter, to read this schema only, but then it will fail because this bigquery driver will return <dataset_id>.<table_name> instead of just <table_name> and SQLAlchemy will add schema prefix one more time.

The BigQuery dialect is actually violating the interface here I think - all of the other dialects return only the table name from get_table_names and get_view_names, which is why SQLAlchemy adds the schema.

schema=None to get_{table,view}_names is supposed to mean "search the default schema", not "search all schemas", but the BigQuery dialect searches all (likely because there isn't always a "default" schema, if it's not set on the connection url).

This part is a slightly separate from the OP though, so I'll open another issue for it.


This also wreaks havoc when using alembic revision --autogenerate without a default dataset (ie: schema is set on table object). The python models have a proper schema table arg and a bare table name, but the reflected ones end up with both a schema arg and a {schema}.{table} table name (try with a breakpoint() just above here).

@JacobHayes
Copy link
Contributor

JacobHayes commented Jun 27, 2024

In the absence of additional evidence that this is truly an issue with python-bigquery-sqlalchemy, gonna close this issue.

The issue is that the BigQueryDialect.get_{table,view}_names methods call list_dataset in the _get_table_or_view_names helper:

def _get_table_or_view_names(self, connection, item_types, schema=None):
current_schema = schema or self.dataset_id
get_table_name = (
self._build_formatted_table_id
if self.dataset_id is None
else operator.attrgetter("table_id")
)
client = connection.connection._client
datasets = client.list_datasets()

Those methods are only supposed to operate on a single schema at a time iiuc, so this should be fixed in #1089, but the OP would need to still call .reflect and set schema= for each one allowed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants