Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rest_api): custom client for specific resources #2082

Merged
merged 8 commits into from
Nov 27, 2024

Conversation

joscha
Copy link
Contributor

@joscha joscha commented Nov 21, 2024

Description

This adds the ability to pass a custom ClientConfig to an EndpointResource.
The passed ClientConfig object will be merged with the global one from the containing @dlt.resource.

Example

{
    "client": {
        "base_url": "https://api.mything.com",
        # Default auth is using Bearer authentication
        "auth": {
            "type": "bearer",
            "token": dlt.secrets["your_api_token"],
        }
    },
    "resources": [
        "resource-using-bearer-auth", # <-- for example this resource
        {
            "name": "my-resource-with-special-auth",
            "client": {
                # However this resource is using HTTP Basic auth instead.
                "auth": HttpBasicAuth("user", dlt.secrets["your_basic_auth_password"]) 
            },
            # ...
        }
    ]
    # ...
}

In the corresponding Slack thread a question came up whether a custom authenticator could be used. The answer is yes, it would be possible to write custom authenticator which introspects the resource requested (base on the URL for example), however the implementation of such an authenticator would be a lot more involved, less obvious than just merging a configuration object and it would also not have an answer for differing base_urls , headers or other client-specific parameters for example.

Related Issues

  • Fixes #...
  • Closes #...
  • Resolves #...

Additional Context

See originating Slack thread.

The reason for this feature is that some API endpoints, even though available under the same base URL and returning the same entities, may have different ways of accessing them. For example the Affinity V2 API uses an OpenAPI-based access model with Bearer authentication. This part of the API is available under https://api.affinity.co/v2/.
The first version of the API is available under https://api.affinity.co/ but uses HTTP Basic Auth.
Entities from both APIs are using the same entity IDs and are related, thus must be specified in the same @dlt.resource in order to define those relations.
This is currently (1.4.0) not possible, due to the differing (incompatible) authentication methods.

@@ -263,12 +263,17 @@ def create_resources(
incremental_cursor_transform,
) = setup_incremental_object(request_params, endpoint_config.get("incremental"))

merged_client_config: ClientConfig = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configs get merged here, with the keys defined in the endpoint taking preference over the global one.

Copy link

netlify bot commented Nov 21, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit b80734e
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/6746e64f9da64d00080590de

headers: Optional[Dict[str, str]]
auth: Optional[AuthConfig]
paginator: Optional[PaginatorConfig]
session: Optional[Session]


class ClientConfig(BaseClientConfig, total=False):
base_url: str # type: ignore[misc]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typing overrides are currently not yet supported (see python/mypy#7435) thus the ignore here.
For the optional dict in each endpoint we need to ensure that all keys are optional.

resources = c.get("resources")
if resources:
for resource in resources:
if isinstance(resource, str) or isinstance(resource, DltResource):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go for the opposite check (is instance of EndpointResource?), is there a reason for testing like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a TypedDict and the check is not (yet) supported on it :-/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, you could test only for dict, but that should be enough. if's it's a dict (even if an incorrect one) the code will try to mask the client field. Then the real validation will happen.

It is not super precise (a comment to mention that the actual test should be for the proper type, but python...), but a bit simpler to read.

Happy to hear other opinions @burnash

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to swap it for a check for dict if that's preferred. Whilst not amazing to read, the current type is a Union out of three possible types, so it currently is precise, but not super future proof.

@joscha joscha force-pushed the joscha/auth-overrides-per-endpoint branch from 4cbf8de to 1bf1fe7 Compare November 21, 2024 20:53
@joscha
Copy link
Contributor Author

joscha commented Nov 22, 2024

@willi-mueller @burnash would you mind taking a look and give me feedback on this, please?

@burnash burnash self-assigned this Nov 24, 2024
@burnash
Copy link
Collaborator

burnash commented Nov 24, 2024

Hi @joscha, thanks you for the contribution. I think this is a great idea. As you may have noticed, rest_api already creates a RESTClient instance for each of the resources under the hood.

One thing that I would like to discuss is that the PR extends to the resource configuration with the client key which, in turn, creates an additional way to configure the paginator. This is already configured via endpoint:

{
    "client": {
        "base_url": "https://api.mything.com",
        "auth": {
            # ...
        },
        "paginator": "json_link"
    },
    "resources": [
        "resource-using-bearer-auth",
        {
            "name": "my-resource-with-special-auth",
            "client": {
                "auth": HttpBasicAuth("user", dlt.secrets["your_basic_auth_password"]),
                "paginator": "json_link"  # <-- now it could be configured here as well
            },
            "endpoint": {
                "paginator": "json_link"
            }
        }
    ]
}

I think it would be better to keep the configuration in one place. To achieve this, I see two options:

  1. Move the paginator configuration to the resources's client from the endpoint configuration.
  2. Do not add a client configuration to resources, but extend the endpoint configuration to accept auth.

Although I think the first option is more consistent with the global client configuration, I would prefer the second option because:

  1. It is shorter for cases where the user needs to configure pagination (which happens often).
  2. It is backward-compatible.

Your example would then look like this:

{
    "client": {
        "base_url": "https://api.mything.com",
        "auth": {
            "type": "bearer",
            "token": dlt.secrets["your_api_token"],
        }
    },
    "resources": [
        "resource-using-bearer-auth",
        {
            "name": "my-resource-with-special-auth",
            "endpoint": {
                "path": "my-resource-with-special-auth",
                "auth": HttpBasicAuth("user", dlt.secrets["your_basic_auth_password"]),
                "paginator": "json_link",
            }
        }
    ]
}

What do you think?

@joscha
Copy link
Contributor Author

joscha commented Nov 24, 2024

Hi @burnash,

I initially had only auth as it was enough for what I wanted to do initially but when I opened the PR I noticed that this would not actually allow a completely reconfigured rest client. Some APIs when migrating between versions may choose to use different header parameters and/or base_urls so I thought overwriting/merging the whole client object would be best, as this would first of all work for all eventualities but also in concept be more simple than just add one separate object and then later another one, etc. whilst having to stay backwards compatible.
What do you think about removing the paginator as a possible property for overrides of the client parameters? Would that be an option? I do generally think your outlined 1. is cleaner long-term. If we were to accept both we could get the benefit of the short notation. Given the endpoint is more deeply nested I think the behavior would be that the endpoint paginator would override resource paginator would override global client paginator.
It is quite some baggage to carry though, for the convenience of omitting one extra client {}.

@francescomucio
Copy link
Contributor

For my experience, changing the paginator is something more common than changing the authentication method for different endpoints of the same API.

I am in favour of the first option, but I can see how it can cause some confusion. In my opinion we should make clear how things are handled, possible options are:

  1. the client in the resource cannot contain a client.paginator, but only an endpoint.paginator
  2. the client in the resource can contain a client.paginator, but the endpoint.paginator wins.

So it will be like

client.paginator < default_resources.client.paginator < default_resource.endpoint.paginator < resource.client.paginator < resource.endpoint.paginator

What do you think?

@joscha
Copy link
Contributor Author

joscha commented Nov 25, 2024

For my experience, changing the paginator is something more common than changing the authentication method for different endpoints of the same API.

I am in favour of the first option, but I can see how it can cause some confusion. In my opinion we should make clear how things are handled, possible options are:

  1. the client in the resource cannot contain a client.paginator, but only an endpoint.paginator
  2. the client in the resource can contain a client.paginator, but the endpoint.paginator wins.

So it will be like

client.paginator < default_resources.client.paginator < default_resource.endpoint.paginator < resource.client.paginator < resource.endpoint.paginator

What do you think?

I think either of these are okay. Personally I'd most likely change the type of the accepted client config in the resource to NOT accept a paginator (option 1).

@burnash
Copy link
Collaborator

burnash commented Nov 25, 2024

Thank you @joscha and @francescomucio for your thoughtful input and discussion on this PR. After considering all the points raised, I wanted to share my perspective.

I'm not fully convinced that adding a client configuration directly to the resource is the optimal approach. In practice, setting the paginator explicitly per endpoint is much more common than changing the authentication method within the same API. Introducing a client config at the resource level may lead to unnecessary ambiguity.

Regarding the headers, I believe these should be specified within the endpoint configuration. In fact, there's already a PR adding headers at the endpoint level #2084

As for the base_url, I'm not sure it's necessary within the resource configuration since endpoint.path can accept an absolute URL (I'd need update the documentation to reflect this). Putting headers and auth into the endpoint configuration would make it consistent with how the underlying RESTClient and RESTClient.paginate operate.

Here's my proposal for moving forward.

  1. Extend the endpoint configuration to accept auth which overrides the client's auth configuration for the endpoint.
  2. Add headers configuration to the endpoint via FEAT: parameterised headers rest_api_source #2084

Here's how the configuration would look:

{
    "client": {
        "base_url": "https://api.mything.com",
        "auth": {
            "type": "bearer",
            "token": dlt.secrets["your_api_token"],
        }
    },
    "resources": [
        "resource-using-bearer-auth",
        {
            "name": "my-resource-with-special-auth",
            "endpoint": {
                "path": "my-resource-with-special-auth",
                "auth": HttpBasicAuth("user", dlt.secrets["your_basic_auth_password"]),
                "paginator": "json_link",
                "headers": {"Custom-Header": "value"}
            }
        }
    ]
}

@joscha, would you be open to updating the PR to reflect this approach?

@joscha
Copy link
Contributor Author

joscha commented Nov 25, 2024

I can most definitely update the PR that way if that's what you wish. I think possibly spreading the individual properties of Client into the endpoint config is shortsighted as it will lead to an explosion of highly specialized properties spread into that object of whenever another is missing to support the next person's use case.
But I haven't DLT long enough to say whether that is likely or not, so given that, unless I hear back from you otherwise, I will update the PR later to use auth and headers.
I will also update the docs with the mention of the absolute URL as it is a related issue, I was not aware this was accepted before you wrote it.
Thank you!

@joscha
Copy link
Contributor Author

joscha commented Nov 25, 2024 via email

@burnash
Copy link
Collaborator

burnash commented Nov 25, 2024

Thank you @joscha for your understanding and flexibility! Let's proceed with adding just the auth configuration to the endpoint config for now. Since #2084 is already handling the headers implementation. I will help with the managing the conflicts.

@joscha joscha force-pushed the joscha/auth-overrides-per-endpoint branch from 1bf1fe7 to dff62c9 Compare November 26, 2024 00:24
@joscha
Copy link
Contributor Author

joscha commented Nov 26, 2024

changes in @burnash. Decided to do the mention of being able to use an absolute URL separately in pull request #2099.

@joscha
Copy link
Contributor Author

joscha commented Nov 26, 2024

I just saw that verified-sources seems to have a copy (?) of the rest client - see
https://github.com/dlt-hub/verified-sources/blob/3d2993a685e8c1d713a2283ac4025ca3903ed392/sources/rest_api/__init__.py#L249

  1. is that intentional? (my assumption is yes, although none of the other verified sources seem to use it)
  2. is there a process that syncs the two? If yes, where can I find it? if no, why not?
  3. if 2.) is "no", should I make the same change in this repository as well?

Copy link
Collaborator

@burnash burnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks you @joscha
Look good, there are a couple of small improvements needed, please the comments.

dlt/sources/rest_api/__init__.py Outdated Show resolved Hide resolved
dlt/sources/rest_api/__init__.py Outdated Show resolved Hide resolved
dlt/sources/rest_api/__init__.py Outdated Show resolved Hide resolved
@burnash
Copy link
Collaborator

burnash commented Nov 26, 2024

Very good question.

I just saw that verified-sources seems to have a copy (?) of the rest client - see https://github.com/dlt-hub/verified-sources/blob/3d2993a685e8c1d713a2283ac4025ca3903ed392/sources/rest_api/__init__.py#L249

  1. is that intentional? (my assumption is yes, although none of the other verified sources seem to use it)

Yes, it's intentional, we moved rest_api, sql_database and filesystem sources to dlt core in 1.0.0. Versions prior to 1.0.0 still reference these sources as being in the verified-sources repository when you execute dlt init.

  1. is there a process that syncs the two? If yes, where can I find it? if no, why not?

There isn’t a formal sync mechanism in place. If someone is working with projects that require or benefit from the latest updates or additional features available in the newer versions of these resources, I recommend upgrading to dlt version 1.0.0 or later.

  1. if 2.) is "no", should I make the same change in this repository as well?

As this is a feature PR, for now, there's no need to port these changes to verified-sources repo. For bugfixes it's different.

@joscha joscha requested a review from burnash November 26, 2024 22:39
@joscha
Copy link
Contributor Author

joscha commented Nov 27, 2024

🎉 all good to go now

@joscha joscha requested a review from burnash November 27, 2024 10:13
Copy link
Collaborator

@burnash burnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 all good to go now

Yes, looks great, thank you!

@burnash burnash merged commit aa80667 into dlt-hub:devel Nov 27, 2024
59 checks passed
@joscha
Copy link
Contributor Author

joscha commented Nov 27, 2024

Awesome, thank you. Are there instructions on how to use the current devel branch at a given sha? The only mention of nightly builds I could find is here: #997

@joscha joscha deleted the joscha/auth-overrides-per-endpoint branch November 27, 2024 12:29
@burnash
Copy link
Collaborator

burnash commented Nov 27, 2024

You can use git+ prefix in pip, for example:

Install a devel branch:

pip install "git+https://github.com/dlt-hub/dlt.git@devel"

Install a devel branch with duckdb extra

pip install "git+https://github.com/dlt-hub/dlt.git@devel#egg=dlt[duckdb]"

Replace devel with an sha to get a specific commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants