Threading through capture schema names to materializations #1579

mdibaiee · 2024-08-26T15:46:11Z

mdibaiee
Aug 26, 2024
Maintainer

One of the desired features is to be able to thread through the schema name of a captured table to the materialization. This is specially useful if capturing from multiple schemas (which might also be from multiple tasks).

A strawman idea was to use the second-to-last part of the collection name as the schema name in materializations. This means if a collection is called:

estuary/my-task/my_schema.my-collection

Then my-task becomes the schema name of the materialization. To actually pass through the schema name though, we would need to capture collections with a different naming convention:

estuary/my-task/my-schema/my-collection

This would mean my-schema becomes the schema of the materialized binding. However, this creates an issue with migration of existing tasks and of updates to existing tasks. Existing tasks will have collections and materializations named without the schema as their second-to-last part. If a new binding is added to the materialization (this process automatically fills in the x-collection-name and x-schema-name of the resource binding configuration), the new binding will have the task name as its schema which is not desired.

Moreover, we will want to somehow still support global schema configurations so users can specify a fixed schema for all collections to materialize into. With this in mind, we can say that tasks that already have a globally configured schema name, will retain their behavior as-is and continue materializing into the global schema, thus we keep backward compatibility and don't disrupt existing tasks. One side-effect of this will be that new collections which have a different naming schema, will not have the schema of the table as part of their table name anymore. Previously collections would be named with my-schema_my-table, but now they will be my-table and to reproduce the old behavior we will need to distinguish between collections that have 4 and the ones that have 3 parts, but I'm wary of introducing this logic to materializations, it seems like a tech debt we will never be able to get rid of, so ideally we can accept this compromise. This becomes a bit more challenging if the user has multiple tables with the same name, but with different schema names, that now end up wanting to be materialized into the same table. The escape hatch in these instances is to switch to the new mode and specify schema names manually in resource bindings.

So materialization tasks that already exist and have a global schema configuration will continue working as they are, without the second-to-last part of the component becoming the schema of the table.

Materializations will be updated to have a global schema as an optional configuration (we should make sure all existing tasks have this configuration filled in), and if it is not specified, the bindings will use the second-to-last part of the collection name as the schema name of the binding. Capture connectors will also then want to create collections with schema name separated by a slash from the collection name, instead of the current _ splitting.

So capture connectors will create collections with names:

tenant/my-task/schema/table_name

And materializations use schema as the schema name in resource binding configurations. The field in the resource configuration that should be filled is to be annotated with x-schema-name: true.

The main challenge for users then is if they have existing collections and they want to use this new feature: they will need to manually specify the schema name for these older collections, and even then they will end up with table names prefixed with the captured schema name, but new collections going forward will work well with the garden path.

In summary, this is the compromise: Existing collections will always have the wrong naming, and so their use along with new collections will always lead to some inconsistency on the side of the customer
1. Concretely, if they have a mix of old and new collections and they specify a global schema name, the new collections will not be prefixed with the captured schema name, while old ones will be prefixed
2. If no global schema is specified, the old collections will be prefixed despite being in a binding-specific schema

dyaffe · 2024-08-26T16:56:46Z

dyaffe
Aug 26, 2024
Maintainer

An alternative proposal that may be a bit simpler would be to take the my-task portion of the current name and use that as the schema name.

The rationale for this would be that it would also work for SaaS apps too. In those situations, a user would still want to likely materialize their data to a schema named "Hubspot" for instance and that data could come from the task name but not from the schema.

I suppose the counter to my counter is we could do something more like (1) if there is a schema use it (2) if there's not, use the name of the saas app that the data came from. That would probably end up being the most likely to be aligned with the customers intentions.

2 replies

mdibaiee Aug 26, 2024
Maintainer Author

In case of SaaS connectors we will continue naming collections as tenant/task/collection_name which would again have the same output as you described

dyaffe Aug 26, 2024
Maintainer

Well then I can't come up with any reason this wouldn't work and it sounds pretty good to me minus the knee jerk against changing the way we name things.

williamhbaker · 2024-08-28T12:42:17Z

williamhbaker
Aug 28, 2024
Maintainer

I'm a little concerned about adding additional significance to the structure of collection names, which are at some level a free-form user input, although they are generally of a static structure when captures are created through the UI.

Has there been any consideration given to making the source binding resource path directly available to consumers of the collection? Admittedly I'm not sure what the best way of accomplishing this would be, but am thinking it could possibly be added as a new field to the collection spec. The current proposal is basically encoding that information into the collection name by convention. I think the more direct approach would also have advantages for compatibility with pre-existing collections - if there's not a source resource path listed in the collection spec, continue using the "old" way, for example - and it wouldn't require additional materialization configuration options to work. It would unambiguously communicate the concept that this collection originated from something with the concept of a "schema", via the fact that the resource path contains two parts.

I've probably missed some discussion on this, so this may have already been ruled out, but wanted to ask anyway.

2 replies

mdibaiee Sep 2, 2024
Maintainer Author

@williamhbaker that is a fair point I think. One note though is that I think we would like to have task-name schema names for SaaS connectors as Dave mentioned above:

The rationale for this would be that it would also work for SaaS apps too. In those situations, a user would still want to likely materialize their data to a schema named "Hubspot" for instance and that data could come from the task name but not from the schema.

This would work automatically with the current proposal of using the convention of 2nd-to-last path component as the schema, in which case SaaS connectors will have "source-hubspot" as their schema name.

Other than this, I can't think of a strong point against your proposal.

mdibaiee Sep 2, 2024
Maintainer Author

Regarding creating a convention on collection names: I personally don't think this is an issue / a concerning limitation, as most platforms do have some form of convention for their resource names. The convention supports the garden path, but in other cases, if someone wants to take a different path, they are not forced to follow the convention, they just will need to manually fill in the materialization resource binding configurations themselves with the appropriate schema and table names, so I think this a convention of convenience.

mdibaiee · 2024-09-04T08:34:22Z

mdibaiee
Sep 4, 2024
Maintainer Author

One note about the change in semantics of global and resource-specific schemas with this change:

Until now, global schema would be overriden by resource-specific schema. With the new change, global schema overrides resource-specific schemas. Users who had both global and resurce-specific schemas will need to be migrated manually by filling bindings which do not have a resource binding with the global schema value, and removing the global schema (this maintains the previous behavior for these tasks)

0 replies

williamhbaker · 2024-09-04T15:38:56Z

williamhbaker
Sep 4, 2024
Maintainer

Yeah I am generally also concerned about how this would change the meaning of materialization configurations.

Materializations like Snowflake and BigQuery already require a "schema" input in their endpoint configuration to work (BigQuery uses datasets to as the equivalent concept to schemas). They need this to know where to put the metadata tables. This is always going to present for them, and that seems like it would be problematic. Would we need a second optional "default schema" input?

1 reply

mdibaiee Sep 4, 2024
Maintainer Author

That's a good point, one potential path out of this is:

Keep semantics of global and binding-specific schema the same as it is: the global config serves as a fallback/default, and binding-specific one overrides it. During migration, for all existing tasks add the schema manually to all the bindings (to avoid them being automatically replaced by the collection's 2nd-to-last component), and keep the global schema as the schema of the metadata tables.

The challenge with this approach is any new bindings added to these tasks will pick up the schema from the 2nd-to-last part of the collection name, and if that's not desired, the user needs to update them. This would have been a problem for customers who had resource-specific schema configurations previously as well, so I think it's not a new problem, it's just applicable to more cases.

mdibaiee · 2024-09-04T16:40:49Z

mdibaiee
Sep 4, 2024
Maintainer Author

Had a discussion with team, here is where we have arrived now:

We will have x-schema-name and x-delta-updates annotations in materialization resource bindings. This change is not blocked by anything and can go ahead right now.
We will have a new variant of sourceCapture where the user gets to specify how should schemas and delta updates be handled when adding new bindings: should the materialization schema name be populated from the collection name's 2nd-to-last part, or should it be left blank? Should new bindings be added as delta updates?
In the UI we will have an option when adding bindings to a materialization to specify whether the new bindings will have their schema filled by the collection's 2nd-to-last part, be left blank, or be populated by a user-provided custom value.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threading through capture schema names to materializations #1579

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Threading through capture schema names to materializations #1579

mdibaiee Aug 26, 2024 Maintainer

Replies: 5 comments · 5 replies

dyaffe Aug 26, 2024 Maintainer

mdibaiee Aug 26, 2024 Maintainer Author

dyaffe Aug 26, 2024 Maintainer

williamhbaker Aug 28, 2024 Maintainer

mdibaiee Sep 2, 2024 Maintainer Author

mdibaiee Sep 2, 2024 Maintainer Author

mdibaiee Sep 4, 2024 Maintainer Author

williamhbaker Sep 4, 2024 Maintainer

mdibaiee Sep 4, 2024 Maintainer Author

mdibaiee Sep 4, 2024 Maintainer Author

mdibaiee
Aug 26, 2024
Maintainer

Replies: 5 comments 5 replies

dyaffe
Aug 26, 2024
Maintainer

mdibaiee Aug 26, 2024
Maintainer Author

dyaffe Aug 26, 2024
Maintainer

williamhbaker
Aug 28, 2024
Maintainer

mdibaiee Sep 2, 2024
Maintainer Author

mdibaiee Sep 2, 2024
Maintainer Author

mdibaiee
Sep 4, 2024
Maintainer Author

williamhbaker
Sep 4, 2024
Maintainer

mdibaiee Sep 4, 2024
Maintainer Author

mdibaiee
Sep 4, 2024
Maintainer Author