Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tech Spec for Allow checking access to destination schema in replication config UI (check-with-catalog) #20561

Closed
grishick opened this issue Dec 16, 2022 · 18 comments
Assignees
Labels
area/frontend Related to the Airbyte webapp team/destinations Destinations team's backlog

Comments

@grishick
Copy link
Contributor

grishick commented Dec 16, 2022

Problem Statement

Currently, we allow users to test their destination configuration only when they are setting up the destination. However, replication configuration can cause destinations to fail as well. The specific problem that we have observed so far is that Snowflake destination is configured with some default schema and we check for access to that schema as well as for ability to create new schemas during connector setup. If the user then sets up replication and selects the "Mirror source structure" or "Namespace custom format" in "Destination Namespace" options, Snowflake destination can end up trying to write to a schema that it does not have access to. This was observed in several customer accounts when syncing data from Postgres source where tables are in schema "public". The problem is reported as

Schema 'PUBLIC' already exists, but current role has no privileges on it. If this is unexpected and you cannot resolve this problem, contact your system administrator. ACCOUNTADMIN role may be required to manage the privileges on the object.

Furthermore, this problem cannot currently be caught in CHECK method, because CHECK does not have access to source schema or replication configuration. We already do CHECK for ability to create new schema, so the problem happens only when the target schema already exists and we don’t have permissions to write to it and we don’t know the name of the target schema until the sync starts

Desired solution

For better UX, I suggest that we add a "check" button to Replication page and add a new call to the protocol, which will allow the connectors (or at least the destination connector) to check if the replication configuration is correct, not just if the connector configuration is correct.

Doing so will also require having a new type of check method (checkReplication?) that takes replication options and source schema as arguments in addition to destination configuration.

Current workaround

Currently, the problem is only detected after the failure of the first sync, which has a negative affect on first experience and activation.

@octavia-squidington-iv
Copy link

cc @airbytehq/frontend

@evantahler
Copy link
Contributor

evantahler commented Dec 16, 2022

Alternative Solution

For CHECK commands which run as part of a sync (because the previous sync attempt has failed - #13459), we provide the configured-catalog.json as an argument to the CHECK command in addition to the config.json. This way, if the ConfiguredCatalog is present the CHECK command can opt to use this additional information to do a more thorough check, knowing the schema and table names it will need.

Pros:

  • No need for a new Airbyte protocol command

Cons:

  • Change to inputs of CHECK command which are only available sometimes
  • Users still won't see the failure until after that first failing sync (but it is assumed that the sync will fail quickly)

@evantahler
Copy link
Contributor

evantahler commented Dec 16, 2022

As a note, I find this problem really interesting because this is the first instance I've seen where the configuration of the source can materially effect the success of the destination. I think to-date we've generally assumed that Source and Destination configurations are fully independent of each other, and the sync configuration independent of both. The protocol lacks any sort of "compatibility ACK" between source, destination, sync.

@evantahler
Copy link
Contributor

Also, @grishick - asking the lazy product question: Do we see users who replicate multiple schemas, and if we do, are there regularly conflicting table names? Perhaps "mirror source" doesn't need to imply the schemas as well?

@evantahler
Copy link
Contributor

Alternative Solution

... maybe in the docs we say we need an "admin"-level user who can write to all schemas.

@grishick
Copy link
Contributor Author

grishick commented Dec 19, 2022

A common pattern is for a workspace to have multiple sources configured to write to a single destination with schema name specified in the replication config.

We can say "admin" level in the docs, but technically we don't need that, so I hesitate to make this a requirement.

I am OK with changing the signature of the existing CHECK method. I don't think it is very important whether we add a new checkWithCatalog method or add parameters to existing CHECK method. I suspect that it will be easier to add a new method then updating the existing one, because:

  • fewer connectors to change
  • fewer if-else forks to add inside CHECK code paths in platform and connectors
  • maybe cleaner - that's a taste preference, so not sure about it

I want to run checkWithCatalog from the Replication Configuration screen before starting the first sync job. I think this is a better UX than finding out that your sync does not work after you've left that screen. If we add this new method (or add a new parameter to CHECK), we can also add a flag to connector definition saying "supports check with catalog" and show this button only if the destination supports check with catalog

@grishick
Copy link
Contributor Author

grishick commented Dec 19, 2022

Also, @grishick - asking the lazy product question: Do we see users who replicate multiple schemas
yes (I sent some examples on Slack, but generally, yes - this is a common pattern)

and if we do, are there regularly conflicting table names? Perhaps "mirror source" doesn't need to imply the schemas as well?

Conflicting names sometimes come up on OC, but I cannot say that that's a regular occasion. However, it is usual for users to specify target schema name in replication configuration. If that schema already exists and the destination connector does not have permissions to write to it, we will detect that only during the second attempt of the first sync.

@evantahler evantahler changed the title Allow checking access to destination schema in replication config UI Allow checking access to destination schema in replication config UI (check-with-catalog) Dec 19, 2022
@grishick
Copy link
Contributor Author

grishick commented Jan 4, 2023

Also, @grishick - asking the lazy product question: Do we see users who replicate multiple schemas, and if we do, are there regularly conflicting table names? Perhaps "mirror source" doesn't need to imply the schemas as well?

Yes. We often see users replicating multiple sources into the same destination (that's kind of the point of a data warehouse). I have seen several occasions where stream names collide at destinations (especially, streams like "accounts", "devices"). Sometimes, users figure it out by themselves, and add prefixes, sometimes this requires handholding, often they just give up. It would be even better if CHECK could check not only access to the actual destination schema (not just the default one), but also check for name collisions (e.g., hey, you already have "accounts_devices" table in your destination - consider using a prefix)?

@etsybaev
Copy link
Contributor

etsybaev commented Jan 4, 2023

It seems like the #19998 is also related to this one

@grishick
Copy link
Contributor Author

grishick commented Jan 4, 2023

This small issue will make the problem less painful in the meantime: #21030

@grishick
Copy link
Contributor Author

grishick commented Jan 5, 2023

Notes from grooming:
At the stage where the user is creating a connection, we don't have to run all the same checks that we do when creating a connector. If we can run a smaller set of checks at connection creation stage, it would be better to have a different method (i.e. checkWithCatalog).

There are several trade-offs and alternatives to consider here. Someone from the destinations team will need to sit down with connector-ops and platform and write a tech spec for this change.

@evantahler
Copy link
Contributor

Notes from a conversation with @grishick:

I (@evantahler) prefer modifying the existing CHECK method to take an optional second argument pair-config, e.g. docker run check --config /path/to/config.json --pair-config /path/to/source/config.json.

  • We can remove the need to get a new command working for all existing connectors and re-publish them. With a new command, we would need to make every existing connector support it, or explicitly opt-out of it
  • I think (to be validated) that all the existing CHECK commands should be OK being passed an un-used optional argument. We will need to check both Java and Python connectors.

We can then use the existence (or not) of the second pair-config argument to know if ware doing a solo CHECK or a CHECK in the context of a paired connection. This would allow us to do smart things like do the full CHECK test in solo-mode, and only do the additional tests for the schema the destination will write to in pair-mode (as we know we've already passed the solo checks if have gotten to the pair check)

@evantahler
Copy link
Contributor

As a note, when talking about this project, it's probably important to socialize the learning that at least for destinations, CHECK cannot be done in isolation (source information matters).

@grishick
Copy link
Contributor Author

Scoping this to writing a tech spec

@grishick
Copy link
Contributor Author

Removing team labels other than team/destinations from this issue to avoid polluting other teams' backlogs while @jbfbell is working on the tech spec.

@jbfbell
Copy link
Contributor

jbfbell commented Jan 31, 2023

Posted the tech spec for review

@jbfbell
Copy link
Contributor

jbfbell commented Feb 10, 2023

An outcome of the tech spec review was that we chose to update the existing CHECK method to allow more information to be passed in to make a more thorough check of whether the connection will be successful.

@jbfbell
Copy link
Contributor

jbfbell commented Feb 16, 2023

Closing this issue as the tech spec has been completed and approved. Tickets have been created representing the milestones in the tech spec

@jbfbell jbfbell closed this as completed Feb 16, 2023
@evantahler evantahler changed the title Allow checking access to destination schema in replication config UI (check-with-catalog) Tech Spec for Allow checking access to destination schema in replication config UI (check-with-catalog) Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/frontend Related to the Airbyte webapp team/destinations Destinations team's backlog
Projects
None yet
Development

No branches or pull requests

5 participants