-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Schema conforming should not auto-add underscores to table and column names #1205
Comments
I don't love the conforming stuff so I bypassed it https://github.com/MeltanoLabs/target-postgres/pull/35/files#diff-faad182dd89cef8b5eca1795448fb031e26b9e964a49e282e9ebffa575b87f21R216-R218 Not the best solution but much better than this bug |
This one really needs to be fixed, and it should probably be a fix to the
|
@kgpayne seems like this one might be good to pick up as part of the tap/target snowflake work. cc @aaronsteers |
Agreed. Calling out from the link, it looks like all-caps works are just broken. I don't know if we can rely on I would suggest removing snake case operation and keeping other conversions of illegal characters. Optionally, now or in future, we could make this a (probably user-specified) setting for targets. One inherent challenge here is that the question of whether to coerce to snakecase not will depend as much on the target dev's preferences, but on the nature of the tap it is getting data from. E.g. definitely not needed from Even if helpful, since naming stability is one of our primary goals here, it might make sense to keep it out for now, until we have a very high confidence that the name translation is stable over a large variety of edge cases. |
@radbrt - Is this something you might be interested to contribute? |
@aaronsteers : I can take a closer look at the snakecase function and write down some options. The breaking changes issue makes this kinda tricky. We want to fix it as soon as possible, do it correctly the first time around, but we could also debate this indefinitely. Mostly I believe that lowercasing all-caps columns would not break any future improvements, but I guess it is hard to guarantee. |
Totally with you on this. Lowercasing and removal of special characters both seem sound+stable. I think we thought snake casing would also be stable (thinking of |
Lower casing when pulling from another database like Oracle doesn't make a lot of sense to me. For non database sources it seems ok, but my gut says leave the data alone we are asking for trouble mapping names. Does renaming just belong in mappers? It is a mapping of source data we are doing here
I vote minimal transformations, like removing invalid characters like dash from schema names and using an underscore. Maybe we put the extra transforms behind a flag? |
As always, Stack Overflow has a good suggestion: def snakecase(self, name):
name = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
name = re.sub('([a-z0-9])([A-Z])', r'\1_\2', name)
return name.lower() Some example output: print(conform_name('camel2_camel2_case')) # camel2_camel2_case
print(conform_name('getHTTPResponseCode')) # get_http_response_code
print(conform_name('HTTPResponseCodeXYZ')) # http_response_code_xyz
print(conform_name('ABCTradingPartnersLLC')) # abc_trading_partners_llc
print(conform_name('ABCTradingPartnersLLC_q1_results_24!')) # abc_trading_partners_llc_q1_results_24
print(conform_name("ONLYCAPITALLETTERS1_XYZ")) # onlycapitalletters1_xyz One less ideal output: print(conform_name('ABCTradingPartnersLLCq1Results_24!')) # abc_trading_partners_ll_cq1_results_24 Some more testing might be in order, but so far I have thrown the new This is most definitely changing column names more than technically necessary, so if it is a contentious issue I'm OK with a switch like @visch suggests. In which case we also need a function to replace just the disallowed features of a column name. Any thoughts @aaronsteers @kgpayne @tayloramurphy ? |
@radbrt I really like your improved snake case method 👏 I also think its worth revisiting the motivation for The main driver is that stream property identifiers are much more permissive than database object identifiers (database, schema, table and column names). e.g. "record": {
"select": "all",
"Select": "also all",
"I can be anything 🤦♂️": "doh",
"MoreRealisticMixedCase": "this comes up a lot",
} These are all valid record properties, but for many sql targets they are either invalid or require double quotes (which is often onerous on downstream consumers). So our initial implementation aimed to:
As @aaronsteers says, we hoped snake-case would readily satisfy that second objective, though in hindsight it may have been less conservative than strictly necessary. Snake case is as much a 'quality of life' improvement (esp. for those using With that in mind, I am actually happy with either the "strip illegal chars and reserved names, then Maybe @tayloramurphy and @pnadolny13 might have opinions from a 'data consumers' perspective, or can help us reach consensus on what a "sensible default" might be here 👍 |
For comparison, |
@visch I understand your hesitance to transform names, but I also think pushing name transformations into Mappers generates extra steps and work for project maintainers, unless we find a way to distribute mappers as 'sensible defaults' (i.e. "if your target is a sql type, we recommend applying this mapper to conform column names). For non-SDK taps, we would also necessitate the use of This does make me think that, from a Metadata perspective, we ought to provide a mechanism for future lineage tooling to discover the upstream, untransformed name. I.e. if the tap catalog contains |
0 conforming isn't the right answer either. Snake case seems like too much to me, unless we're also subscribing folks to some kind of "Meltano style guide" that we have. Even in that case we could default that on, but it should be something you can deactivate. Most renaming today I think happens in dbt so it's pretty normal for folks I think (mappers does seem like a stretch now that you say it) We should be keeping conforming to a minimum imo 🤷 , bugs like this issue come in and there's a plethora of edge cases we haven't started to test/hit yet as well. Less is more here I think
Add it to the catalog? Generally dbt seems like the right place for most of this. |
@kgpayne my thoughts:
For @visch edge case - couldnt we add logic to check for things like this in the post transformed field names. Duplicates should raise an error, then the user can fall back to mappers or raise an issue for the target to support their edge case. |
There are probably a lot of good ways to think about name standardization and the right level of control, visibility, customization, and stability. As an immediate action though, I think we should probably just disable the snake casing altogether. We can let target developers handle this if they want to override the base implementation. This also reduces scope so we can move forward with the fix asap. Thoughts? |
@aaronsteers I agree. It's causing more problems than it solves, apparently. |
Singer SDK Version
0.14.0
Python Version
3.8
Bug scope
Targets (data type handling, batching, SQL object generation, etc.)
Operating System
Ubuntu
Description
Code and result is in a gist (Schema example is too large for a description here I guess https://gist.github.com/visch/e1c2f8c17d6d0e09e40c80964eb7c51b )
Not sure exactly what's going on here as I haven't dove in
Code
No response
Other places this has come up
https://meltano.slack.com/archives/C01TCRBBJD7/p1670275818704839?thread_ts=1670274133.009739&cid=C01TCRBBJD7 here as well
The text was updated successfully, but these errors were encountered: