Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-828] [Feature] dbt-core raising compile errors for invalid metric names (and potentially invalid model names) #5456

Closed
1 task done
callum-mcdata opened this issue Jul 11, 2022 · 8 comments · Fixed by #5841
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@callum-mcdata
Copy link
Contributor

callum-mcdata commented Jul 11, 2022

Is this your first time opening an issue?

Describe the Feature

Continuing this conversation from CT-65 [Bug] Metrics' names shouldn’t be allowed to contain spaces after some more recent work on the metrics package. @joellabes had raised some really great points around additional constraints for metric names that weren't fixed when that issue was closed. Namely, leading numbers and periods in metric names that would cause column aliasing to fail.

If I understand the issue correctly, there are two paths forward that I can see:

1. Introduce additional naming constraints into dbt parsing logic based around db column naming

  • Pros:
    • Users will immediately know at compile time whether or not their metric names are valid
    • We can confidently use metric names as an alias in the metric package
  • Cons:
    • Not a small amount of work. To my knowledge we don't have these kind of db column/table naming constraints documented anywhere so we'd have to research those for the big 4, develop some kind of framework that can be used by adapter maintainers, and then implement the new conditions into the parsing logic.

2. Accept run errors around column names

  • Pros:
    • We don't need to add additional functionality to dbt-core
    • We can get away with clear documentation around the metric.
  • Cons:
    • The error message users get will not help inform them around what the real error is. They'll get a column name not valid error but unless we have clear documentation around this error, it won't point them towards the fix.

Based on the amount of work laid out in the Con section of option 1, I am personally of the opinion that we move forward with option 2 until there is bandwidth or interest in the core team to tackle something like this. Granted, I could be totally off base around how much work it is and that could change the math here.

Describe alternatives you've considered

^See option 2.

Who will this benefit?

This will benefit anyone using the metrics package and potentially more if the scope is expanded to model names.

Are you interested in contributing this feature?

Potentially! Depends on the complexity.

Anything else?

Nope! For those who read the entire issue, thank you - please enjoy this very cute video of a bunny!

@callum-mcdata callum-mcdata added enhancement New feature or request triage labels Jul 11, 2022
@github-actions github-actions bot changed the title [Feature] dbt-core raising compile errors for invalid metric names (and potentially invalid model names) [CT-828] [Feature] dbt-core raising compile errors for invalid metric names (and potentially invalid model names) Jul 11, 2022
@joellabes
Copy link
Contributor

Third option: 1b. All the pros of Option 1, swaps out the con from option 1 for a small subset of option 2's cons.

Basically: Disallow things we know are bad (columns starting with digits, periods in column names) and play a bit of whack-a-mole on anything else that comes up as/if it happens.

@jtcohen6 jtcohen6 added this to the v1.3 milestone Jul 14, 2022
@jtcohen6
Copy link
Contributor

I was going to propose something similar to Joel's idea above, from the opposite starting point: What if we were just very strict about metric naming? Alphanumeric characters and underscores, can't start with a number. That's it. Not too tricky to regex for that

I'm pretty sure that's the rule for column naming in BigQuery + Apache Spark, and it ought to be the rule for Snowflake, since anything trickier than that requires quotes (bad!!).

For the ultimate implementation, we just need to update this logic with a slightly fancier regex:

if "name" in data and " " in data["name"]:
raise ParsingException(f"Metrics name '{data['name']}' cannot contain spaces")

Pulling this into the v1.3 milestone, since it feels like a thing we should try to get right before metrics' "experimentality" comes to an end!

@callum-mcdata
Copy link
Contributor Author

That works for me - seems like the simplest solution that satisfies the requirements. Figured it would be good to confirm with an example and the documentation.

The Big 4

  • Snowflake: Documentation

  • Start with a letter (A-Z, a-z) or an underscore (“”)._

  • Contain only letters, underscores, decimal digits (0-9), and dollar signs (“$”).

  • BigQuery: Documentation

  • A column name must contain only letters (a-z, A-Z), numbers (0-9), or underscores (_),

  • It must start with a letter or underscore.

  • A column name cannot use any of the following prefixes:

    • _ TABLE _
    • _ FILE _
    • _PARTITION
    • _ROW_TIMESTAMP
    • __ ROOT __
    • _COLIDENTIFIER
  • Redshift: Documentation

  • Begin with an ASCII single-byte alphabetic character or underscore character

  • Subsequent characters can be ASCII single-byte alphanumeric characters, underscores, or dollar signs

  • Be between 1 and 127 bytes in length, not including quotation marks for delimited identifiers.

  • Not be a reserved SQL key word.

  • Databricsk: Documentation

  • We can just impose our best practices and that should work

  • letter: Any letter from A-Z or a-z.

  • digit: Any numeral from 0 to 9.

  • c: Any character from the character set. Use to escape special characters (for example,.` ).

Proposed Regex:

  • Metric name characters must be alphanumeric or underscore
  • Metric name must begin with an letter (this allows us to get around the BQ naming prefixes which all begin with _ )
  • Metric names must be 126 characters or less (Redshift constraint. BQ and Snowflake are longer)
  • Metric name must not be a single word in the following list
Restricted Single Words Reason/Source
ACCOUNT Snowflake
CASE Snowflake
CAST Snowflake
CONNECTION Snowflake
CONSTRAINT Snowflake
CROSS Snowflake
CURRENT_DATE Snowflake
CURRENT_TIME Snowflake
CURRENT_TIMESTAMP Snowflake
CURRENT_USER Snowflake
DATABASE Snowflake
FALSE Snowflake
FULL Snowflake
GSCLUSTER Snowflake
ILIKE Snowflake
INCREMENT Snowflake
INNER Snowflake
ISSUE Snowflake
JOIN Snowflake
LATERAL Snowflake
LEFT Snowflake
LOCALTIME Snowflake
LOCALTIMESTAMP Snowflake
MINUS Snowflake
NATURAL Snowflake
ORGANIZATION Snowflake
QUALIFY Snowflake
REGEXP Snowflake
RIGHT Snowflake
RLIKE Snowflake
SCHEMA Snowflake
SOME Snowflake
TRUE Snowflake
TRY_CAST Snowflake
USING Snowflake
VIEW Snowflake
WHEN Snowflake
ALL Reserved by ANSI.
ALTER Reserved by ANSI.
AND Reserved by ANSI.
ANY Reserved by ANSI.
AS Reserved by ANSI.
BETWEEN Reserved by ANSI.
BY Reserved by ANSI.
CHECK Reserved by ANSI.
COLUMN Reserved by ANSI.
CONNECT Reserved by ANSI.
CREATE Reserved by ANSI.
CURRENT Reserved by ANSI.
DELETE Reserved by ANSI.
DISTINCT Reserved by ANSI.
DROP Reserved by ANSI.
ELSE Reserved by ANSI.
EXISTS Reserved by ANSI.
FOLLOWING Reserved by ANSI.
FOR Reserved by ANSI.
FROM Reserved by ANSI.
GRANT Reserved by ANSI.
GROUP Reserved by ANSI.
HAVING Reserved by ANSI.
IN Reserved by ANSI.
INSERT Reserved by ANSI.
INTERSECT Reserved by ANSI.
INTO Reserved by ANSI.
IS Reserved by ANSI.
LIKE Reserved by ANSI.
NOT Reserved by ANSI.
NULL Reserved by ANSI.
OF Reserved by ANSI.
ON Reserved by ANSI.
OR Reserved by ANSI.
ORDER Reserved by ANSI.
REVOKE Reserved by ANSI.
ROW Reserved by ANSI.
ROWS Reserved by ANSI.
SAMPLE Reserved by ANSI.
SELECT Reserved by ANSI.
SET Reserved by ANSI.
START Reserved by ANSI.
TABLESAMPLE Reserved by ANSI.
THEN Reserved by ANSI.
TO Reserved by ANSI.
TRIGGER Reserved by ANSI.
UNION Reserved by ANSI.
UNIQUE Reserved by ANSI.
UPDATE Reserved by ANSI.
VALUES Reserved by ANSI.
WHENEVER Reserved by ANSI.
WHERE Reserved by ANSI.
WITH Reserved by ANSI.
AES128 Redshift
AES256 Redshift
ALLOWOVERWRITE Redshift
ANALYSE Redshift
ANALYZE Redshift
AUTHORIZATION Redshift
AZ64 Redshift
BACKUP Redshift
BINARY Redshift
BLANKSASNULL Redshift
BOTH Redshift
BYTEDICT Redshift
BZIP2 Redshift
CREDENTIALS Redshift
CURRENT_USER_ID Redshift
DEFERRABLE Redshift
DEFLATE Redshift
DEFRAG Redshift
DELTA Redshift
DELTA32K Redshift
DISABLE Redshift
DO Redshift
EMPTYASNULL Redshift
ENABLE Redshift
ENCODE Redshift
ENCRYPT Redshift
ENCRYPTION Redshift
EXPLICIT Redshift
FOREIGN Redshift
FREEZE Redshift
GLOBALDICT256 Redshift
GLOBALDICT64K Redshift
GZIP Redshift
INITIALLY Redshift
ISNULL Redshift
LANGUAGE Redshift
LEADING Redshift
LUN Redshift
LUNS Redshift
LZO Redshift
LZOP Redshift
MOSTLY16 Redshift
MOSTLY32 Redshift
MOSTLY8 Redshift
NOTNULL Redshift
OFF Redshift
OFFLINE Redshift
OFFSET Redshift
OID Redshift
OLD Redshift
ONLY Redshift
OPEN Redshift
OVERLAPS Redshift
PARALLEL Redshift
PERCENT Redshift
PERMISSIONS Redshift
PIVOT Redshift
PLACING Redshift
PRIMARY Redshift
RAW Redshift
READRATIO Redshift
RECOVER Redshift
REFERENCES Redshift
REJECTLOG Redshift
RESORT Redshift
RESTORE Redshift
SESSION_USER Redshift
SIMILAR Redshift
SNAPSHOT Redshift
SYSDATE Redshift
SYSTEM Redshift
TABLE Redshift
TAG Redshift
TDES Redshift
TEXT255 Redshift
TEXT32K Redshift
TIMESTAMP Redshift
TOP Redshift
TRAILING Redshift
TRUNCATECOLUMNS Redshift
UNPIVOT Redshift
USER Redshift
VERBOSE Redshift
WALLET Redshift
WITHOUT Redshift
ASC BigQuery
ARRAY BigQuery
ASSERT_ROWS_MODIFIED BigQuery
AT BigQuery
COLLATE BigQuery
CONTAINS BigQuery
CUBE BigQuery
DEFAULT BigQuery
DEFINE BigQuery
DESC BigQuery
END BigQuery
ENUM BigQuery
ESCAPE BigQuery
EXCEPT BigQuery
EXCLUDE BigQuery
EXTRACT BigQuery
FETCH BigQuery
GROUPING BigQuery
GROUPS BigQuery
HASH BigQuery
IF BigQuery
IGNORE BigQuery
INTERVAL BigQuery
LIMIT BigQuery
LOOKUP BigQuery
MERGE BigQuery
NEW BigQuery
NO BigQuery
NULLS BigQuery
OUTER BigQuery
OVER BigQuery
PARTITION BigQuery
PRECEDING BigQuery
PROTO BigQuery
RANGE BigQuery
RECURSIVE BigQuery
RESPECT BigQuery
ROLLUP BigQuery
STRUCT BigQuery
TREAT BigQuery
UNBOUNDED BigQuery
UNNEST BigQuery
WINDOW BigQuery
WITHIN BigQuery

@jtcohen6
Copy link
Contributor

Thanks for the refinement @callum-mcdata!

These two make total sense to me:

  • Metric name characters must be alphanumeric or underscore
  • Metric name must begin with an letter (this allows us to get around the BQ naming prefixes which all begin with _ )
  • Metric names must be 126 characters or less (Redshift constraint. BQ and Snowflake are longer)

I could see some folks eventually running into issues with the third one, though I don't mind it personally. Concision is a virtue (not one I always live up to) — and we support description for a reason!

This is the only one that concerns me:

Metric name must not be a single word in the following list

To perform that validation, we'd need to either:

  1. Store this corpus for ALL supported databases in dbt-core — ANSI standard keywords feel reasonable to have in dbt-core, the others less so
  2. Store the reserved words in the adapter, and use an adapter method to retrieve reserved words — awkward and technically complex, based on when this parsing / validation logic is happening

We'd also need to update the list of keywords each time the database adds a new one. That feels more reasonable to do if we go with option 2, and trust each adapter maintainer to keep on top of it. Maybe, in the future, this could be a feature of our SQL grammar + dialects...?

For now - how willing would you be to punt on item 4? We would document: "The name of your metric must not match a reserved SQL keyword in your database."

@callum-mcdata
Copy link
Contributor Author

@jtcohen6 I'd be fine with punting on option 4. I got a bit carried away last night trying to think of all possible complications that would raise issues for naming metrics 😅

Maybe, in the future, this could be a feature of our SQL grammar + dialects...?

I think this is a great point and, at least to me, helps justify punting on this. Once that work stream is in flight, we can revisit this discussion and see if we want to go with option 2 (adapters) or the SQL grammar.

As for option 3 (character limits) I definitely agree that people will run into this constraint. description and label are our friends - internally we've created metric names that break this "rule" and they're near impossible to understand at a glance. We've taken to using label as the "Name" in 3rd party integrations because the metric name is too long.

So with that in mind, the proposed changes are:

  1. Metric name characters must be alphanumeric or underscore
  2. Metric name must begin with an letter (this allows us to get around the BQ naming prefixes which all begin with _ )
  3. Metric names must be 126 characters or less (Redshift constraint. BQ and Snowflake are longer)

@jtcohen6
Copy link
Contributor

description and label are our friends - internally we've created metric names that break this "rule" and they're near impossible to understand at a glance. We've taken to using label as the "Name" in 3rd party integrations because the metric name is too long.

💯

I feel even better about the character constraint. Let's make it happen!

@leahwicz
Copy link
Contributor

This is the acceptance criteria to close this ticket out:

* Metric name characters must be alphanumeric or underscore
* Metric name must begin with an letter (this allows us to get around the BQ naming prefixes which all begin with _ )
* Metric names must be 126 characters or less (Redshift constraint. BQ and Snowflake are longer)

@emmyoop
Copy link
Member

emmyoop commented Sep 8, 2022

This should include validating names for exposures as well #5606

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants