pandas-gbq auth proposal #161

tswast · 2018-04-07T00:41:31Z

Overview

The current auth flows for pandas-gbq are a bit confusing and hard to customize.

Final desired state. The pandas_gbq module should have the following (changes in bold):

read_gbq(query, project_id [optional], index_col=None, col_order=None, reauth, verbose [deprecated], private_key [deprecated], auth_local_webserver, dialect='legacy', configuration [optional], credentials [new param, optional])
to_gbq(dataframe, destination_table, project_id [optional], chunksize=None, verbose [deprecated], reauth, if_exists='fail', private_key [deprecated], auth_local_webserver, table_schema=None, credentials [new param, optional])
CredentialsCache (and WriteOnlyCredentialsCache, NoopCredentialsCache) - new class (and subclasses) for configuring user credentials caching behavior
context - global singleton with "client" property for caching default client in-memory.
get_user_credentials(scopes=None, credentials_cache=None, client_secrets=None, use_localhost_webserver=False) - Helper function to get user authentication credentials.

Tasks:

Add authentication documentation with examples.
Add optional credentials parameter to read_gbq, taking a google.cloud.bigquery.Client object.
Add optional credentials parameter to to_gbq, taking a google.cloud.bigquery.Client object.
Add pandas_gbq.get_user_credentials() helper for fetching user credentials with installed-app OAuth2 flow.
Add pandas_gbq.CredentialsCache and related subclasses for managing user credentials cache.
Add pandas_gbq.context global for caching a default Client in-memory. Add examples for manually setting pandas_gbq.context.client (so that default project and other values like location can be set).
Update minimum google-cloud-bigquery version to 0.32.0 so that the project ID in the client can be overridden when creating query & load jobs. (Done in ENH: Add location parameter to read_gbq and to_gbq #185)
Deprecate private_key argument. Show examples of how to do the same thing by passing Credentials to the Client constructor.
Deprecate PANDAS_GBQ_CREDENTIALS_FILE environment variable. Show example using pandas_gbq.get_user_credentials with credentials_cache argument.
* [ ] Deprecate reauth argument. Show examples using pandas_gbq.get_user_credentials with credentials_cache argument and WriteOnlyCredentialsCache or NoopCredentialsCache. Edit: No reason to deprecate reauth, since we don't need to complicate pandas-gbq's auth with pydata-google-auth's implementation details.
* [ ] Deprecate auth_local_webserver argument. Show example using pandas_gbq.get_user_credentials with auth_local_webserver argument. Edit: No reason to deprecate auth_local_webserver, as that feature is still needed. We don't actually want to force people to use pydata-google-auth for the default credentials case.

Background

pandas-gbq has its own auth flows, which include but are distinct from "application default credentials".

See issue: #129

Current (0.4.0) state of pandas-gbq auth:

Use service account key file passed in as private_key parameter. Parameter can be either as JSON bytes or a file path.
Use application default credentials.
1. Use service account key at GOOGLE_APPLICATION_CREDENTIALS environment variable.
2. Use service account associated with Compute Engine, Kubernetes Engine, App Engine, or Cloud Functions.
Use user authentication.
1. Attempt to load user credentials from cache stored at ~/.config/pandas_gbq/bigquery_credentials.dat or in path specified by PANDAS_GBQ_CREDENTIALS_FILE environment variable.
2. Do 3-legged OAuth flow.
3. Cache the user credentials to disk.

Why does pandas-gbq do user auth at all? Aren't application default credentials enough?

It's difficult in some environments to set the right environment variables, so a way to explicitly provide credentials is desired.
BigQuery does resource-based billing, so it is possible to use user-based authentication.
- User-based authentication eliminates the unnecessary step of creating a service account.
- A user with the BigQuery User IAM role wouldn't be allowed to create a service account.
- Often datasets are shared with a specific user. Querying with user account credentials will allow them to access those shared datasets / tables.
- User-based authentication is more intuitive in shared notebook environments like Colab, where the compute credentials might be associated with a service account in a shadow project or not available at all.

Problems with the current flow

The credentials order isn't always ideal.
It's not possible to specify user credentials in environments where application default credentials are available.
If someone is familiar with the google-auth library, the current auth mechanisms do not allow passing in an arbitrary Credentials object.
It is verbose and error-prone to pass in explicit service account credentials every time. See Set project_id (and other settings) once for all subsequent queries so you don't have to pass every time #103 for a feature request for more configurable defaults.
- Error-prone? More than once have I and the other pandas-gbq contributors forgot to add a private_key argument to a call in a test, resulting in surprising failures in CI builds.
It's not possible to override the scopes for the credentials. For example, it is useful to add Drive / Sheets scopes for querying external data sources.

Proposal

Document default auth behavior

Current behavior (not changing, except for deprecations).

Use client if passed in.
Deprecated. Use private_key to create a Client if passed in. Use google-auth and credentials argument instead.
Attempt to create client using application default credentials. Intersphinx link to google.auth.default
Attempt to construct client using user credentials (project_id parameter must be passed in). Link to pandas_gbq.get_user_credentials().

New default auth behavior.

1b. If client not passed in, attempt to use global client at pandas_gbq.context (similar to google.cloud.bigquery.magics.context). If there is no client in the global context: run steps 2-4 and set the client it creates to the global context.

Add `client` parameter to `read_gbq` and `to_gbq`

The new client parameter, if provided, would bypass all other credentials fetching mechanisms.

Why a Client and not an explicit Credentials object?

A Client contains a default project (See feature request for default projects at Set project_id (and other settings) once for all subsequent queries so you don't have to pass every time #103) and will eventually handle other defaults, such as location, encryption configuration, and maximum bytes billed.
A Client object supports more BigQuery operations than will ever be exposed by pandas-gbq (creating datasets, modifying ACLs, other property updates). Passing this in as a parameter could hint to developers that they can use the Client directly for those things.
It is more clear that BigQuery magic command is provided by google-cloud-bigquery not pandas-gbq.

Helpers for user-based authentication

No helpers are needed for default credentials or service account credentials because these can easily be constructed with the google-auth library. Link to samples for constructing these from the docs.

pandas_gbq.get_user_credentials(scopes=None, credentials_cache=None, client_secrets=None, use_localhost_webserver=False):

If credentials_cache is None, construct a pandas_gbq.CredentialsCache with defaults for arguments.

Attempt to load credentials from cache.

If credentials can't be loaded, start 3-legged oauth2 flow for installed applications. Use provided client secrets if given, otherwise use Pandas-GBQ client secrets. Use command-line flow by default. Use localhost webserver if set to True.

No credentials could be fetched? Raise an AccessDenied error. (Existing behavior of GbqConnector.get_user_account_credentials())

Save credentials to cache.

Return credentials.

pandas_gbq.CredentialsCache

Constructor takes optional credentials_path.

If credentials_path not provided, set self._credentials_path to

PANDAS_GBQ_CREDENTIALS_FILE - show deprecation warning that this environment variable will be ignored at a later date.
Default user credentials path at ~/.config/pandas_gbq/bigquery_credentials.dat

Methods

load() - load credentials from self._credentials_path, refresh them, and return them. Otherwise, return None if credentials not found.
save(credentials) - write credentials as JSON to self._credentials_path.

pandas_gbq.WriteOnlyCredentialsCache

Same as CredentialsCache, but load() is a no-op. Equivalent to "force reauth" in current versions.

pandas_gbq.NoopCredentialsCache

Satisfies the credentials cache interface, but does nothing. Useful for shared systems where you want credentials to stay in memory (e.g. Colab).

Deprecations

Some time should be given (1-year deprecation?) for folks to migrate to the new client argument. It might be used in scripts and older notebooks, and also is a parameter upstream in Pandas.

Deprecate the PANDAS_GBQ_CREDENTIALS_FILE environment variable

Log a deprecation warning suggesting pandas_gbq.get_user_credentials with a pandas_gbq.CredentialsCache argument.

Deprecate `private_key` argument

Log a deprecation warning suggesting google.oauth2.service_account.Credentials.from_service_account_info instead of passing in bytes and google.oauth2.service_account.Credentials.from_service_account_file instead of passing in a path.

Add / link to service account examples in the docs.

Deprecate `reauth` argument

Log a deprecation warning suggesting creating a client using credentials from pandas_gbq.get_user_credentials and a pandas_gbq.WriteOnlyCredentialsCache

Add user authentication examples in the docs.

Deprecate `auth_local_webserver` argument

Log a deprecation warning suggesting creating a client using credentials from pandas_gbq.get_user_credentials and set the auth_local_webserver argument there.

Add user authentication examples in the docs.

/cc @craigcitro @maxim-lian

The text was updated successfully, but these errors were encountered:

tswast · 2018-08-31T18:37:27Z

#171 got me thinking. There are cases when we'll want client objects besides the google.cloud.bigquery client. In the case of #171, we'll need to construct a Storage client.

I propose that whereever I suggested a client argument in this proposal, we actually ask for credentials.

max-sixty · 2018-08-31T20:01:06Z

💯 , and then the library can manage Clients (i.e. this doesn't mean we'd need to create a new Client each request)

Trim pydata-google-auth package and add tests This is the initial version of the proposed pydata-google-auth package (to be used by pandas-gbq and ibis). It includes two methods: * `pydata_google_auth.default()` * A function that does the same as pandas-gbq does auth currently. Tries `google.auth.default()` and then falls back to user credentials. * `pydata_google_auth.get_user_credentials()` * A public `get_user_credentials()` function, as proposed in googleapis/python-bigquery-pandas#161. Missing in this implementation is a more configurable way to adjust credentials caching. I currently use the `reauth` logic from pandas-gbq. I drop `try_credentials()`, as it makes less sense when this module might be used for other APIs besides BigQuery. Plus there were problems with `try_credentials()` even for pandas-gbq (googleapis/python-bigquery-pandas#202, googleapis/python-bigquery-pandas#198).

tswast · 2018-10-26T21:43:30Z

Add pandas_gbq.get_user_credentials()

This was released as part of the pydata-google-auth package. Documented at https://pydata-google-auth.readthedocs.io/en/latest/api.html#pydata_google_auth.get_user_credentials

christianramsey · 2018-10-28T20:47:33Z

@tswast glad this was released but what does this mean for pandas-gbq? I'm still having an issue with drive scopes and was hoping this could possibly solve it. Does this solve the issue in some way?

tswast · 2018-10-29T17:03:58Z

@christianramsey I'm glad you asked. Yes, the combination of #231 and https://pydata-google-auth.readthedocs.io/en/latest/api.html#pydata_google_auth.get_user_credentials allows you to use drive scopes. I (or some helpful contributor 😃 ) need to

Release pandas-gbq 0.8.0 to get ENH: Add credentials argument to read_gbq and to_gbq. #231 into PyPI and conda.
Make some examples and put them into the pandas-gbq auth guide.
Update pandas to use pandas-gbq 0.8.0 (might happen in time for pandas 0.24, but maybe not)

A briefly example of using the drive scope:

Until pandas-gbq 0.8.0 is released, install from the latest on GitHub

pip install --upgrade git+https://github.com/pydata/pandas-gbq.git

Install pydata-google-auth

pip install --upgrade pydata-google-auth

auth_example.py:

import pandas_gbq
import pydata_google_auth
import pydata_google_auth.cache

# Instead of get_user_credentials(), you could do default(), but that may not
# be able to get the right scopes if running on GCE or using credentials from
# the gcloud command-line tool.
credentials = pydata_google_auth.get_user_credentials(
    scopes=[
        'https://www.googleapis.com/auth/drive',
        'https://www.googleapis.com/auth/cloud-platform',
    ],
    # Use reauth to get new credentials if you haven't used the drive scope
    # before. You only have to do this once.
    credentials_cache=pydata_google_auth.cache.REAUTH,
    # Set auth_local_webserver to True to have a slightly more convienient
    # authorization flow. Note, this doesn't work if you're running from a
    # notebook on a remote sever, such as with Google Colab.
    auth_local_webserver=True,
)

sql = """SELECT state_name
FROM `my_dataset.us_states_from_google_sheets`
WHERE post_abbr LIKE 'W%'
"""

df = pandas_gbq.read_gbq(
    sql,
    project_id='YOUR-PROJECT-ID',
    credentials=credentials,
    dialect='standard',
)

print(df)

tswast · 2018-10-29T18:11:27Z

@christianramsey Actually, you can use pydata-google-auth with pandas-gbq 0.7.0 today by using the fact that we have an in-memory cache of credentials now.

import pandas
import pandas_gbq
import pydata_google_auth
import pydata_google_auth.cache

credentials = pydata_google_auth.get_user_credentials(
    scopes=[
        'https://www.googleapis.com/auth/drive',
        'https://www.googleapis.com/auth/cloud-platform',
    ],
)

# Update the in-memory credentials cache (added in pandas-gbq 0.7.0).
pandas_gbq.context.credentials = credentials
pandas_gbq.context.project = 'your-project-id'

sql = """SELECT state_name
FROM `my_dataset.us_states_from_google_sheets`
WHERE post_abbr LIKE 'W%'
"""

df = pandas_gbq.read_gbq(
    sql,
    dialect='standard',
)

print(df)

christianramsey · 2018-10-30T13:47:03Z

The above code worked! xie xie @tswast

tswast · 2018-12-20T18:13:52Z

It appears PANDAS_GBQ_CREDENTIALS_FILE isn't actually used after #176

There is some logic that reads it, but then the value is never used.

https://github.com/pydata/pandas-gbq/blob/08590bdcb2476aa7712bcee7d13afb2dfb7ea0de/pandas_gbq/gbq.py#L316

I guess I don't have to mark it deprecated since it was broken, anyway? For users that do want similar functionality: to choose the cache location with an environment variable, pydata/pydata-google-auth#7 tracks that feature request in pydata-google-auth.

tswast added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Apr 7, 2018

tswast added this to the 0.5.0 milestone Apr 7, 2018

This was referenced Apr 7, 2018

Authenticate with credentials #156

Closed

ENH: Clear authentication defaults with more fine-grained control #129

Closed

Provide a way to override requested scopes #38

Closed

tswast self-assigned this Apr 7, 2018

tswast mentioned this issue Jun 8, 2018

Warn when using user credentials from the Cloud SDK googleapis/google-auth-library-python#266

Merged

tswast removed this from the 0.5.0 milestone Jun 25, 2018

This was referenced Jun 26, 2018

Update to_gbq and read_gbq to pandas-gbq 0.5.0 pandas-dev/pandas#21628

Merged

Add anchor links to versions in the changelog #191

Merged

tswast mentioned this issue Aug 7, 2018

Set project_id (and other settings) once for all subsequent queries so you don't have to pass every time #103

Closed

This was referenced Aug 14, 2018

Implement CredentialsCache base classes pydata/pydata-google-auth#1

Closed

User credentials for BigQuery ibis-project/ibis#1583

Closed

Do we need to SELECT 1 before each query? #198

Closed

tswast mentioned this issue Sep 7, 2018

Trim pydata-google-auth package and add tests pydata/pydata-google-auth#3

Merged

tswast mentioned this issue Sep 14, 2018

[ENH] Add CredentialsCache classes pydata/pydata-google-auth#4

Merged

tswast mentioned this issue Oct 26, 2018

ENH: Add credentials argument to read_gbq and to_gbq. #231

Merged

piotch mentioned this issue Dec 18, 2018

google big query :: scopes support ToucanToco/toucan-connectors#60

Closed

tswast mentioned this issue Dec 20, 2018

ENH: deprecate private_key argument #240

Merged

tswast mentioned this issue Dec 20, 2018

CLN: use pydata-google-auth for auth flow #241

Merged

tswast closed this as completed in #241 Jan 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas-gbq auth proposal #161

pandas-gbq auth proposal #161

tswast commented Apr 7, 2018 •

edited

Loading

tswast commented Aug 31, 2018

max-sixty commented Aug 31, 2018

tswast commented Oct 26, 2018

christianramsey commented Oct 28, 2018

tswast commented Oct 29, 2018 •

edited

Loading

tswast commented Oct 29, 2018 •

edited

Loading

christianramsey commented Oct 30, 2018 •

edited

Loading

tswast commented Dec 20, 2018

pandas-gbq auth proposal #161

pandas-gbq auth proposal #161

Comments

tswast commented Apr 7, 2018 • edited Loading

Overview

Background

Problems with the current flow

Proposal

Document default auth behavior

Add client parameter to read_gbq and to_gbq

Helpers for user-based authentication

pandas_gbq.get_user_credentials(scopes=None, credentials_cache=None, client_secrets=None, use_localhost_webserver=False):

pandas_gbq.CredentialsCache

pandas_gbq.WriteOnlyCredentialsCache

pandas_gbq.NoopCredentialsCache

Deprecations

Deprecate the PANDAS_GBQ_CREDENTIALS_FILE environment variable

Deprecate private_key argument

Deprecate reauth argument

Deprecate auth_local_webserver argument

tswast commented Aug 31, 2018

max-sixty commented Aug 31, 2018

tswast commented Oct 26, 2018

christianramsey commented Oct 28, 2018

tswast commented Oct 29, 2018 • edited Loading

tswast commented Oct 29, 2018 • edited Loading

christianramsey commented Oct 30, 2018 • edited Loading

tswast commented Dec 20, 2018

tswast commented Apr 7, 2018 •

edited

Loading

Add `client` parameter to `read_gbq` and `to_gbq`

Deprecate `private_key` argument

Deprecate `reauth` argument

Deprecate `auth_local_webserver` argument

tswast commented Oct 29, 2018 •

edited

Loading

tswast commented Oct 29, 2018 •

edited

Loading

christianramsey commented Oct 30, 2018 •

edited

Loading