Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Schema conforming should not auto-add underscores to table and column names #1205

Closed
visch opened this issue Nov 19, 2022 · 16 comments · Fixed by #1342
Closed

Bug: Schema conforming should not auto-add underscores to table and column names #1205

visch opened this issue Nov 19, 2022 · 16 comments · Fixed by #1342
Assignees
Labels

Comments

@visch
Copy link
Contributor

visch commented Nov 19, 2022

Singer SDK Version

0.14.0

Python Version

3.8

Bug scope

Targets (data type handling, batching, SQL object generation, etc.)

Operating System

Ubuntu

Description

Code and result is in a gist (Schema example is too large for a description here I guess https://gist.github.com/visch/e1c2f8c17d6d0e09e40c80964eb7c51b )

Not sure exactly what's going on here as I haven't dove in

Code

No response

Other places this has come up

https://meltano.slack.com/archives/C01TCRBBJD7/p1670275818704839?thread_ts=1670274133.009739&cid=C01TCRBBJD7 here as well

@visch visch added kind/Bug Something isn't working valuestream/SDK labels Nov 19, 2022
@visch
Copy link
Contributor Author

visch commented Dec 5, 2022

I don't love the conforming stuff so I bypassed it https://github.com/MeltanoLabs/target-postgres/pull/35/files#diff-faad182dd89cef8b5eca1795448fb031e26b9e964a49e282e9ebffa575b87f21R216-R218

Not the best solution but much better than this bug

@radbrt
Copy link
Contributor

radbrt commented Dec 7, 2022

This one really needs to be fixed, and it should probably be a fix to the snakecase function but I'm inclined to create an intermediate fix inside the conform_name function while we debate camel-case etc. The minimal solution to avoid the most common and absurd result would be something like this:

    if not re.match('^[a-z]+', name):
        name = name.lower()

    name = snakecase(name)

@tayloramurphy
Copy link
Collaborator

@kgpayne seems like this one might be good to pick up as part of the tap/target snowflake work. cc @aaronsteers

@aaronsteers
Copy link
Contributor

Agreed.

Calling out from the link, it looks like all-caps works are just broken. I don't know if we can rely on snakecase() to work properly and in a stable manner. And this is not an area we can 'fix' lightly once it is pervasive since 'fixing' 'EC2' to 'e_c_2' or 'ec_2' or 'ec2' will cause breakage in users' pipelines, since any change at all between syncs is a breaking change.

I would suggest removing snake case operation and keeping other conversions of illegal characters. Optionally, now or in future, we could make this a (probably user-specified) setting for targets.

One inherent challenge here is that the question of whether to coerce to snakecase not will depend as much on the target dev's preferences, but on the nature of the tap it is getting data from. E.g. definitely not needed from tap-snowflake (which would output either all-upper or all-lower) but yes probably good to have for tap-salesforce whose naming convention is almost entirely based on camelcase naming convention.

Even if helpful, since naming stability is one of our primary goals here, it might make sense to keep it out for now, until we have a very high confidence that the name translation is stable over a large variety of edge cases.

@aaronsteers
Copy link
Contributor

@radbrt - Is this something you might be interested to contribute?

@radbrt
Copy link
Contributor

radbrt commented Dec 9, 2022

@aaronsteers : I can take a closer look at the snakecase function and write down some options. The breaking changes issue makes this kinda tricky. We want to fix it as soon as possible, do it correctly the first time around, but we could also debate this indefinitely. Mostly I believe that lowercasing all-caps columns would not break any future improvements, but I guess it is hard to guarantee.

@aaronsteers
Copy link
Contributor

aaronsteers commented Dec 9, 2022

Mostly I believe that lowercasing all-caps columns would not break any future improvements

Totally with you on this. Lowercasing and removal of special characters both seem sound+stable. I think we thought snake casing would also be stable (thinking of SalesId columns converting neatly to sales_id, but now we're just seeing a ton of edge cases, I don't know that injection of an underscore is actually worth the potential readability improvements. (I've dealt with columns landing with names like customeraddress2 - it's a bit annoying but it's also easy enough to mitigate in dbt as you build models that just alias those columns according to your preferred naming conventions.)

@visch
Copy link
Contributor Author

visch commented Dec 9, 2022

Mostly I believe that lowercasing all-caps columns would not break any future improvements

Totally with you on this. Lowercasing and removal of special characters both seem sound+stable. I think we thought snake casing would also be stable (thinking of SalesId columns converting neatly to sales_id, but now we're just seeing a ton of edge cases, I don't know that injection of an underscore is actually worth the potential readability improvements. (I've dealt with columns landing with names like customeraddress2 - it's a bit annoying but it's also easy enough to mitigate in dbt as you build models that just alias those columns according to your preferred naming conventions.)

Lower casing when pulling from another database like Oracle doesn't make a lot of sense to me. For non database sources it seems ok, but my gut says leave the data alone we are asking for trouble mapping names. Does renaming just belong in mappers? It is a mapping of source data we are doing here

  1. collisions are harder to reason with
  2. what id we hit a source that legit has two column names that are the same except with different capitals eeek

I vote minimal transformations, like removing invalid characters like dash from schema names and using an underscore.

Maybe we put the extra transforms behind a flag?

@radbrt
Copy link
Contributor

radbrt commented Dec 9, 2022

As always, Stack Overflow has a good suggestion:

    def snakecase(self, name):
        name = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
        name = re.sub('([a-z0-9])([A-Z])', r'\1_\2', name)
        return name.lower()

Some example output:

print(conform_name('camel2_camel2_case'))  # camel2_camel2_case
print(conform_name('getHTTPResponseCode'))  # get_http_response_code
print(conform_name('HTTPResponseCodeXYZ'))  # http_response_code_xyz
print(conform_name('ABCTradingPartnersLLC'))  # abc_trading_partners_llc
print(conform_name('ABCTradingPartnersLLC_q1_results_24!'))  # abc_trading_partners_llc_q1_results_24
print(conform_name("ONLYCAPITALLETTERS1_XYZ"))  # onlycapitalletters1_xyz

One less ideal output:

print(conform_name('ABCTradingPartnersLLCq1Results_24!'))  # abc_trading_partners_ll_cq1_results_24

Some more testing might be in order, but so far I have thrown the new snakecase function into the SQLSink class and it seems to work fine.

This is most definitely changing column names more than technically necessary, so if it is a contentious issue I'm OK with a switch like @visch suggests. In which case we also need a function to replace just the disallowed features of a column name.

Any thoughts @aaronsteers @kgpayne @tayloramurphy ?

@kgpayne
Copy link
Contributor

kgpayne commented Dec 15, 2022

@radbrt I really like your improved snake case method 👏 I also think its worth revisiting the motivation for conform_name in the first place, to possibly reevaluate SDK defaults.

The main driver is that stream property identifiers are much more permissive than database object identifiers (database, schema, table and column names). e.g.

"record": {
  "select": "all",
  "Select": "also all",
  "I can be anything 🤦‍♂️": "doh",
  "MoreRealisticMixedCase": "this comes up a lot",
}

These are all valid record properties, but for many sql targets they are either invalid or require double quotes (which is often onerous on downstream consumers).

So our initial implementation aimed to:

  • Provide developers with a method to overload with conform logic applicable to their target system.
  • Provide reasonable defaults that would cover most systems, to reduce the amount of dev leg-work involved in writing a target.

As @aaronsteers says, we hoped snake-case would readily satisfy that second objective, though in hindsight it may have been less conservative than strictly necessary. Snake case is as much a 'quality of life' improvement (esp. for those using dbt, also noted above) as well as being generally compliant for most common SQL targets.

With that in mind, I am actually happy with either the "strip illegal chars and reserved names, then .lower()" or the snakecase routes, leaving the ultimate choice to target developers 😅 My personal preference would be (a working!) snakecase implementation, but it would also be valid to include your improved implementation as a helper but not enable it by default.

Maybe @tayloramurphy and @pnadolny13 might have opinions from a 'data consumers' perspective, or can help us reach consensus on what a "sensible default" might be here 👍

@kgpayne
Copy link
Contributor

kgpayne commented Dec 15, 2022

For comparison, pipelinewise-target-snowflake does this. It isn't the easiest to follow, due to the convolution of flattening and transformation, but it looks like they both strip illegal chars using re.sub(r'[a-z]', '', ...) and .lower() as part of their key construction.

@kgpayne
Copy link
Contributor

kgpayne commented Dec 15, 2022

@visch I understand your hesitance to transform names, but I also think pushing name transformations into Mappers generates extra steps and work for project maintainers, unless we find a way to distribute mappers as 'sensible defaults' (i.e. "if your target is a sql type, we recommend applying this mapper to conform column names). For non-SDK taps, we would also necessitate the use of meltano-map-transform as an extra step in their EL pipelines.

This does make me think that, from a Metadata perspective, we ought to provide a mechanism for future lineage tooling to discover the upstream, untransformed name. I.e. if the tap catalog contains HTTPResponseCodeXYZ but the target decides to use http_response_code_xyz the before and after values should ideally appear as a map in a _sdc_property_transforms table in the destination database 🤔 This would theoretically allow a metadata system to traverse further back in the data lineage than just the source tables of the target database. It would also allow us to be more liberal with column name transformations, knowing that the original values are just a join away 😅

@visch
Copy link
Contributor Author

visch commented Dec 15, 2022

@visch I understand your hesitance to transform names, but I also think pushing name transformations into Mappers generates extra steps and work for project maintainers, unless we find a way to distribute mappers as 'sensible defaults' (i.e. "if your target is a sql type, we recommend applying this mapper to conform column names). For non-SDK taps, we would also necessitate the use of meltano-map-transform as an extra step in their EL pipelines.

0 conforming isn't the right answer either. Snake case seems like too much to me, unless we're also subscribing folks to some kind of "Meltano style guide" that we have. Even in that case we could default that on, but it should be something you can deactivate. Most renaming today I think happens in dbt so it's pretty normal for folks I think (mappers does seem like a stretch now that you say it)

We should be keeping conforming to a minimum imo 🤷 , bugs like this issue come in and there's a plethora of edge cases we haven't started to test/hit yet as well. Less is more here I think

This does make me think that, from a Metadata perspective, we ought to provide a mechanism for future lineage tooling to discover the upstream, untransformed name. I.e. if the tap catalog contains HTTPResponseCodeXYZ but the target decides to use http_response_code_xyz the before and after values should ideally appear as a map in a _sdc_property_transforms table in the destination database 🤔 This would theoretically allow a metadata system to traverse further back in the data lineage than just the source tables of the target database. It would also allow us to be more liberal with column name transformations, knowing that the original values are just a join away 😅

Add it to the catalog? Generally dbt seems like the right place for most of this.
Since the target doesn't control the catalog maybe like you say it's a metadata table, with a "target catalog" makes sense to me.

@pnadolny13
Copy link
Contributor

pnadolny13 commented Dec 15, 2022

@kgpayne my thoughts:

  • I think the target should transform the field names so the target system can accept them. I agree snake seems too much though as a default.
  • Requiring users to figure out mappers for this is setting them up for failure. Its hard to do/test especially for new users.
  • Although I agree with Derek that minimal changes is ideal.
  • From my narrow view it seems like this is very target specific so providing the developer with several common helpers and a no-op method as a placeholder for where to put one of those helpers seems like a good solution. If they dont like the helpers they can implement something custom. 🤔 although I see the other side of it where having a single meltano way of doing it would be clearer, one set of docs, clearly communicated, expectations of new target are clear, etc.
  • Another useful thing would be transparency, if these formatting rules were added to the --about README. Or even suggesting the dev leaves comments in the code. For me seeing something like re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name) doesnt tell me much 😅

what id we hit a source that legit has two column names that are the same except with different capitals eeek

For @visch edge case - couldnt we add logic to check for things like this in the post transformed field names. Duplicates should raise an error, then the user can fall back to mappers or raise an issue for the target to support their edge case.

@aaronsteers
Copy link
Contributor

aaronsteers commented Dec 15, 2022

There are probably a lot of good ways to think about name standardization and the right level of control, visibility, customization, and stability.

As an immediate action though, I think we should probably just disable the snake casing altogether. We can let target developers handle this if they want to override the base implementation.

This also reduces scope so we can move forward with the fix asap.

Thoughts?

@edgarrmondragon
Copy link
Collaborator

As an immediate action though, I think we should probably just disable the snake casing altogether. We can let target developers handle this if they want to override the base implementation.

@aaronsteers I agree. It's causing more problems than it solves, apparently.

@aaronsteers aaronsteers changed the title [Bug]: Schema conforming adding underscores to column names Bug: Schema conforming should not auto-add underscores to table and column names Jan 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants