Replies: 5 comments 5 replies
-
An alternative proposal that may be a bit simpler would be to take the The rationale for this would be that it would also work for SaaS apps too. In those situations, a user would still want to likely materialize their data to a schema named "Hubspot" for instance and that data could come from the task name but not from the schema. I suppose the counter to my counter is we could do something more like (1) if there is a schema use it (2) if there's not, use the name of the saas app that the data came from. That would probably end up being the most likely to be aligned with the customers intentions. |
Beta Was this translation helpful? Give feedback.
-
I'm a little concerned about adding additional significance to the structure of collection names, which are at some level a free-form user input, although they are generally of a static structure when captures are created through the UI. Has there been any consideration given to making the source binding resource path directly available to consumers of the collection? Admittedly I'm not sure what the best way of accomplishing this would be, but am thinking it could possibly be added as a new field to the collection spec. The current proposal is basically encoding that information into the collection name by convention. I think the more direct approach would also have advantages for compatibility with pre-existing collections - if there's not a source resource path listed in the collection spec, continue using the "old" way, for example - and it wouldn't require additional materialization configuration options to work. It would unambiguously communicate the concept that this collection originated from something with the concept of a "schema", via the fact that the resource path contains two parts. I've probably missed some discussion on this, so this may have already been ruled out, but wanted to ask anyway. |
Beta Was this translation helpful? Give feedback.
-
One note about the change in semantics of global and resource-specific schemas with this change: Until now, global schema would be overriden by resource-specific schema. With the new change, global schema overrides resource-specific schemas. Users who had both global and resurce-specific schemas will need to be migrated manually by filling bindings which do not have a resource binding with the global schema value, and removing the global schema (this maintains the previous behavior for these tasks) |
Beta Was this translation helpful? Give feedback.
-
Yeah I am generally also concerned about how this would change the meaning of materialization configurations. Materializations like Snowflake and BigQuery already require a "schema" input in their endpoint configuration to work (BigQuery uses datasets to as the equivalent concept to schemas). They need this to know where to put the metadata tables. This is always going to present for them, and that seems like it would be problematic. Would we need a second optional "default schema" input? |
Beta Was this translation helpful? Give feedback.
-
Had a discussion with team, here is where we have arrived now:
|
Beta Was this translation helpful? Give feedback.
-
One of the desired features is to be able to thread through the schema name of a captured table to the materialization. This is specially useful if capturing from multiple schemas (which might also be from multiple tasks).
A strawman idea was to use the second-to-last part of the collection name as the schema name in materializations. This means if a collection is called:
Then
my-task
becomes the schema name of the materialization. To actually pass through the schema name though, we would need to capture collections with a different naming convention:This would mean
my-schema
becomes the schema of the materialized binding. However, this creates an issue with migration of existing tasks and of updates to existing tasks. Existing tasks will have collections and materializations named without the schema as their second-to-last part. If a new binding is added to the materialization (this process automatically fills in thex-collection-name
andx-schema-name
of the resource binding configuration), the new binding will have the task name as its schema which is not desired.Moreover, we will want to somehow still support global schema configurations so users can specify a fixed schema for all collections to materialize into. With this in mind, we can say that tasks that already have a globally configured schema name, will retain their behavior as-is and continue materializing into the global schema, thus we keep backward compatibility and don't disrupt existing tasks. One side-effect of this will be that new collections which have a different naming schema, will not have the schema of the table as part of their table name anymore. Previously collections would be named with
my-schema_my-table
, but now they will bemy-table
and to reproduce the old behavior we will need to distinguish between collections that have 4 and the ones that have 3 parts, but I'm wary of introducing this logic to materializations, it seems like a tech debt we will never be able to get rid of, so ideally we can accept this compromise. This becomes a bit more challenging if the user has multiple tables with the same name, but with different schema names, that now end up wanting to be materialized into the same table. The escape hatch in these instances is to switch to the new mode and specify schema names manually in resource bindings.So materialization tasks that already exist and have a global schema configuration will continue working as they are, without the second-to-last part of the component becoming the schema of the table.
Materializations will be updated to have a global
schema
as an optional configuration (we should make sure all existing tasks have this configuration filled in), and if it is not specified, the bindings will use the second-to-last part of the collection name as the schema name of the binding. Capture connectors will also then want to create collections with schema name separated by a slash from the collection name, instead of the current_
splitting.So capture connectors will create collections with names:
And materializations use
schema
as the schema name in resource binding configurations. The field in the resource configuration that should be filled is to be annotated withx-schema-name: true
.The main challenge for users then is if they have existing collections and they want to use this new feature: they will need to manually specify the schema name for these older collections, and even then they will end up with table names prefixed with the captured schema name, but new collections going forward will work well with the garden path.
In summary, this is the compromise: Existing collections will always have the wrong naming, and so their use along with new collections will always lead to some inconsistency on the side of the customer
1. Concretely, if they have a mix of old and new collections and they specify a global schema name, the new collections will not be prefixed with the captured schema name, while old ones will be prefixed
2. If no global schema is specified, the old collections will be prefixed despite being in a binding-specific schema
Beta Was this translation helpful? Give feedback.
All reactions