[CT-828] [Feature] dbt-core raising compile errors for invalid metric names (and potentially invalid model names) #5456

callum-mcdata · 2022-07-11T14:50:03Z

Is this your first time opening an issue?

I have read the expectations for open source contributors

Describe the Feature

Continuing this conversation from CT-65 [Bug] Metrics' names shouldn’t be allowed to contain spaces after some more recent work on the metrics package. @joellabes had raised some really great points around additional constraints for metric names that weren't fixed when that issue was closed. Namely, leading numbers and periods in metric names that would cause column aliasing to fail.

If I understand the issue correctly, there are two paths forward that I can see:

1. Introduce additional naming constraints into dbt parsing logic based around db column naming

Pros:
- Users will immediately know at compile time whether or not their metric names are valid
- We can confidently use metric names as an alias in the metric package
Cons:
- Not a small amount of work. To my knowledge we don't have these kind of db column/table naming constraints documented anywhere so we'd have to research those for the big 4, develop some kind of framework that can be used by adapter maintainers, and then implement the new conditions into the parsing logic.

2. Accept run errors around column names

Pros:
- We don't need to add additional functionality to dbt-core
- We can get away with clear documentation around the metric.
Cons:
- The error message users get will not help inform them around what the real error is. They'll get a column name not valid error but unless we have clear documentation around this error, it won't point them towards the fix.

Based on the amount of work laid out in the Con section of option 1, I am personally of the opinion that we move forward with option 2 until there is bandwidth or interest in the core team to tackle something like this. Granted, I could be totally off base around how much work it is and that could change the math here.

Describe alternatives you've considered

^See option 2.

Who will this benefit?

This will benefit anyone using the metrics package and potentially more if the scope is expanded to model names.

Are you interested in contributing this feature?

Potentially! Depends on the complexity.

Anything else?

Nope! For those who read the entire issue, thank you - please enjoy this very cute video of a bunny!

joellabes · 2022-07-12T02:20:40Z

Third option: 1b. All the pros of Option 1, swaps out the con from option 1 for a small subset of option 2's cons.

Basically: Disallow things we know are bad (columns starting with digits, periods in column names) and play a bit of whack-a-mole on anything else that comes up as/if it happens.

jtcohen6 · 2022-07-14T10:32:00Z

I was going to propose something similar to Joel's idea above, from the opposite starting point: What if we were just very strict about metric naming? Alphanumeric characters and underscores, can't start with a number. That's it. Not too tricky to regex for that

I'm pretty sure that's the rule for column naming in BigQuery + Apache Spark, and it ought to be the rule for Snowflake, since anything trickier than that requires quotes (bad!!).

For the ultimate implementation, we just need to update this logic with a slightly fancier regex:

dbt-core/core/dbt/contracts/graph/unparsed.py

Lines 468 to 469 in f988f76

    
           if "name" in data and " " in data["name"]: 
        
               raise ParsingException(f"Metrics name '{data['name']}' cannot contain spaces")

Pulling this into the v1.3 milestone, since it feels like a thing we should try to get right before metrics' "experimentality" comes to an end!

callum-mcdata · 2022-07-21T02:12:13Z

That works for me - seems like the simplest solution that satisfies the requirements. Figured it would be good to confirm with an example and the documentation.

The Big 4

Snowflake: Documentation
Start with a letter (A-Z, a-z) or an underscore (“”)._
Contain only letters, underscores, decimal digits (0-9), and dollar signs (“$”).
BigQuery: Documentation
A column name must contain only letters (a-z, A-Z), numbers (0-9), or underscores (_),
It must start with a letter or underscore.
A column name cannot use any of the following prefixes:
- _ TABLE _
- _ FILE _
- _PARTITION
- _ROW_TIMESTAMP
- __ ROOT __
- _COLIDENTIFIER
Redshift: Documentation
Begin with an ASCII single-byte alphabetic character or underscore character
Subsequent characters can be ASCII single-byte alphanumeric characters, underscores, or dollar signs
Be between 1 and 127 bytes in length, not including quotation marks for delimited identifiers.
Not be a reserved SQL key word.
Databricsk: Documentation
We can just impose our best practices and that should work
letter: Any letter from A-Z or a-z.
digit: Any numeral from 0 to 9.
c: Any character from the character set. Use to escape special characters (for example,.` ).

Proposed Regex:

Metric name characters must be alphanumeric or underscore
Metric name must begin with an letter (this allows us to get around the BQ naming prefixes which all begin with _ )
Metric names must be 126 characters or less (Redshift constraint. BQ and Snowflake are longer)
Metric name must not be a single word in the following list

Restricted Single Words	Reason/Source
ACCOUNT	Snowflake
CASE	Snowflake
CAST	Snowflake
CONNECTION	Snowflake
CONSTRAINT	Snowflake
CROSS	Snowflake
CURRENT_DATE	Snowflake
CURRENT_TIME	Snowflake
CURRENT_TIMESTAMP	Snowflake
CURRENT_USER	Snowflake
DATABASE	Snowflake
FALSE	Snowflake
FULL	Snowflake
GSCLUSTER	Snowflake
ILIKE	Snowflake
INCREMENT	Snowflake
INNER	Snowflake
ISSUE	Snowflake
JOIN	Snowflake
LATERAL	Snowflake
LEFT	Snowflake
LOCALTIME	Snowflake
LOCALTIMESTAMP	Snowflake
MINUS	Snowflake
NATURAL	Snowflake
ORGANIZATION	Snowflake
QUALIFY	Snowflake
REGEXP	Snowflake
RIGHT	Snowflake
RLIKE	Snowflake
SCHEMA	Snowflake
SOME	Snowflake
TRUE	Snowflake
TRY_CAST	Snowflake
USING	Snowflake
VIEW	Snowflake
WHEN	Snowflake
ALL	Reserved by ANSI.
ALTER	Reserved by ANSI.
AND	Reserved by ANSI.
ANY	Reserved by ANSI.
AS	Reserved by ANSI.
BETWEEN	Reserved by ANSI.
BY	Reserved by ANSI.
CHECK	Reserved by ANSI.
COLUMN	Reserved by ANSI.
CONNECT	Reserved by ANSI.
CREATE	Reserved by ANSI.
CURRENT	Reserved by ANSI.
DELETE	Reserved by ANSI.
DISTINCT	Reserved by ANSI.
DROP	Reserved by ANSI.
ELSE	Reserved by ANSI.
EXISTS	Reserved by ANSI.
FOLLOWING	Reserved by ANSI.
FOR	Reserved by ANSI.
FROM	Reserved by ANSI.
GRANT	Reserved by ANSI.
GROUP	Reserved by ANSI.
HAVING	Reserved by ANSI.
IN	Reserved by ANSI.
INSERT	Reserved by ANSI.
INTERSECT	Reserved by ANSI.
INTO	Reserved by ANSI.
IS	Reserved by ANSI.
LIKE	Reserved by ANSI.
NOT	Reserved by ANSI.
NULL	Reserved by ANSI.
OF	Reserved by ANSI.
ON	Reserved by ANSI.
OR	Reserved by ANSI.
ORDER	Reserved by ANSI.
REVOKE	Reserved by ANSI.
ROW	Reserved by ANSI.
ROWS	Reserved by ANSI.
SAMPLE	Reserved by ANSI.
SELECT	Reserved by ANSI.
SET	Reserved by ANSI.
START	Reserved by ANSI.
TABLESAMPLE	Reserved by ANSI.
THEN	Reserved by ANSI.
TO	Reserved by ANSI.
TRIGGER	Reserved by ANSI.
UNION	Reserved by ANSI.
UNIQUE	Reserved by ANSI.
UPDATE	Reserved by ANSI.
VALUES	Reserved by ANSI.
WHENEVER	Reserved by ANSI.
WHERE	Reserved by ANSI.
WITH	Reserved by ANSI.
AES128	Redshift
AES256	Redshift
ALLOWOVERWRITE	Redshift
ANALYSE	Redshift
ANALYZE	Redshift
AUTHORIZATION	Redshift
AZ64	Redshift
BACKUP	Redshift
BINARY	Redshift
BLANKSASNULL	Redshift
BOTH	Redshift
BYTEDICT	Redshift
BZIP2	Redshift
CREDENTIALS	Redshift
CURRENT_USER_ID	Redshift
DEFERRABLE	Redshift
DEFLATE	Redshift
DEFRAG	Redshift
DELTA	Redshift
DELTA32K	Redshift
DISABLE	Redshift
DO	Redshift
EMPTYASNULL	Redshift
ENABLE	Redshift
ENCODE	Redshift
ENCRYPT	Redshift
ENCRYPTION	Redshift
EXPLICIT	Redshift
FOREIGN	Redshift
FREEZE	Redshift
GLOBALDICT256	Redshift
GLOBALDICT64K	Redshift
GZIP	Redshift
INITIALLY	Redshift
ISNULL	Redshift
LANGUAGE	Redshift
LEADING	Redshift
LUN	Redshift
LUNS	Redshift
LZO	Redshift
LZOP	Redshift
MOSTLY16	Redshift
MOSTLY32	Redshift
MOSTLY8	Redshift
NOTNULL	Redshift
OFF	Redshift
OFFLINE	Redshift
OFFSET	Redshift
OID	Redshift
OLD	Redshift
ONLY	Redshift
OPEN	Redshift
OVERLAPS	Redshift
PARALLEL	Redshift
PERCENT	Redshift
PERMISSIONS	Redshift
PIVOT	Redshift
PLACING	Redshift
PRIMARY	Redshift
RAW	Redshift
READRATIO	Redshift
RECOVER	Redshift
REFERENCES	Redshift
REJECTLOG	Redshift
RESORT	Redshift
RESTORE	Redshift
SESSION_USER	Redshift
SIMILAR	Redshift
SNAPSHOT	Redshift
SYSDATE	Redshift
SYSTEM	Redshift
TABLE	Redshift
TAG	Redshift
TDES	Redshift
TEXT255	Redshift
TEXT32K	Redshift
TIMESTAMP	Redshift
TOP	Redshift
TRAILING	Redshift
TRUNCATECOLUMNS	Redshift
UNPIVOT	Redshift
USER	Redshift
VERBOSE	Redshift
WALLET	Redshift
WITHOUT	Redshift
ASC	BigQuery
ARRAY	BigQuery
ASSERT_ROWS_MODIFIED	BigQuery
AT	BigQuery
COLLATE	BigQuery
CONTAINS	BigQuery
CUBE	BigQuery
DEFAULT	BigQuery
DEFINE	BigQuery
DESC	BigQuery
END	BigQuery
ENUM	BigQuery
ESCAPE	BigQuery
EXCEPT	BigQuery
EXCLUDE	BigQuery
EXTRACT	BigQuery
FETCH	BigQuery
GROUPING	BigQuery
GROUPS	BigQuery
HASH	BigQuery
IF	BigQuery
IGNORE	BigQuery
INTERVAL	BigQuery
LIMIT	BigQuery
LOOKUP	BigQuery
MERGE	BigQuery
NEW	BigQuery
NO	BigQuery
NULLS	BigQuery
OUTER	BigQuery
OVER	BigQuery
PARTITION	BigQuery
PRECEDING	BigQuery
PROTO	BigQuery
RANGE	BigQuery
RECURSIVE	BigQuery
RESPECT	BigQuery
ROLLUP	BigQuery
STRUCT	BigQuery
TREAT	BigQuery
UNBOUNDED	BigQuery
UNNEST	BigQuery
WINDOW	BigQuery
WITHIN	BigQuery

jtcohen6 · 2022-07-21T13:07:11Z

Thanks for the refinement @callum-mcdata!

These two make total sense to me:

Metric name characters must be alphanumeric or underscore

Metric name must begin with an letter (this allows us to get around the BQ naming prefixes which all begin with _ )

Metric names must be 126 characters or less (Redshift constraint. BQ and Snowflake are longer)

I could see some folks eventually running into issues with the third one, though I don't mind it personally. Concision is a virtue (not one I always live up to) — and we support description for a reason!

This is the only one that concerns me:

Metric name must not be a single word in the following list

To perform that validation, we'd need to either:

Store this corpus for ALL supported databases in dbt-core — ANSI standard keywords feel reasonable to have in dbt-core, the others less so
Store the reserved words in the adapter, and use an adapter method to retrieve reserved words — awkward and technically complex, based on when this parsing / validation logic is happening

We'd also need to update the list of keywords each time the database adds a new one. That feels more reasonable to do if we go with option 2, and trust each adapter maintainer to keep on top of it. Maybe, in the future, this could be a feature of our SQL grammar + dialects...?

For now - how willing would you be to punt on item 4? We would document: "The name of your metric must not match a reserved SQL keyword in your database."

callum-mcdata · 2022-07-21T13:37:39Z

@jtcohen6 I'd be fine with punting on option 4. I got a bit carried away last night trying to think of all possible complications that would raise issues for naming metrics 😅

Maybe, in the future, this could be a feature of our SQL grammar + dialects...?

I think this is a great point and, at least to me, helps justify punting on this. Once that work stream is in flight, we can revisit this discussion and see if we want to go with option 2 (adapters) or the SQL grammar.

As for option 3 (character limits) I definitely agree that people will run into this constraint. description and label are our friends - internally we've created metric names that break this "rule" and they're near impossible to understand at a glance. We've taken to using label as the "Name" in 3rd party integrations because the metric name is too long.

So with that in mind, the proposed changes are:

Metric name characters must be alphanumeric or underscore
Metric name must begin with an letter (this allows us to get around the BQ naming prefixes which all begin with _ )
Metric names must be 126 characters or less (Redshift constraint. BQ and Snowflake are longer)

jtcohen6 · 2022-07-21T13:55:34Z

description and label are our friends - internally we've created metric names that break this "rule" and they're near impossible to understand at a glance. We've taken to using label as the "Name" in 3rd party integrations because the metric name is too long.

💯

I feel even better about the character constraint. Let's make it happen!

leahwicz · 2022-08-30T13:42:43Z

This is the acceptance criteria to close this ticket out:

* Metric name characters must be alphanumeric or underscore
* Metric name must begin with an letter (this allows us to get around the BQ naming prefixes which all begin with _ )
* Metric names must be 126 characters or less (Redshift constraint. BQ and Snowflake are longer)

emmyoop · 2022-09-08T16:24:20Z

This should include validating names for exposures as well #5606

callum-mcdata added enhancement New feature or request triage labels Jul 11, 2022

github-actions bot changed the title ~~[Feature] dbt-core raising compile errors for invalid metric names (and potentially invalid model names)~~ [CT-828] [Feature] dbt-core raising compile errors for invalid metric names (and potentially invalid model names) Jul 11, 2022

callum-mcdata added the metrics label Jul 11, 2022

jtcohen6 added Team:Language and removed triage labels Jul 14, 2022

jtcohen6 added this to the v1.3 milestone Jul 14, 2022

jtcohen6 mentioned this issue Aug 3, 2022

[CT-1184] [Feature] Exposures: stricter validation for name, support custom label #5606

Closed

3 tasks

leahwicz assigned emmyoop Sep 13, 2022

emmyoop mentioned this issue Sep 14, 2022

Add name validation for metrics #5841

Merged

6 tasks

emmyoop closed this as completed in #5841 Sep 21, 2022

callum-mcdata mentioned this issue Nov 28, 2022

[Bug] Metric name equal to reserved keyword throws error. dbt-labs/dbt_metrics#124

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-828] [Feature] dbt-core raising compile errors for invalid metric names (and potentially invalid model names) #5456

[CT-828] [Feature] dbt-core raising compile errors for invalid metric names (and potentially invalid model names) #5456

callum-mcdata commented Jul 11, 2022 •

edited by jira bot

Loading

joellabes commented Jul 12, 2022

jtcohen6 commented Jul 14, 2022

callum-mcdata commented Jul 21, 2022

jtcohen6 commented Jul 21, 2022

callum-mcdata commented Jul 21, 2022

jtcohen6 commented Jul 21, 2022

leahwicz commented Aug 30, 2022

emmyoop commented Sep 8, 2022

[CT-828] [Feature] dbt-core raising compile errors for invalid metric names (and potentially invalid model names) #5456

[CT-828] [Feature] dbt-core raising compile errors for invalid metric names (and potentially invalid model names) #5456

Comments

callum-mcdata commented Jul 11, 2022 • edited by jira bot Loading

Is this your first time opening an issue?

Describe the Feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

joellabes commented Jul 12, 2022

jtcohen6 commented Jul 14, 2022

callum-mcdata commented Jul 21, 2022

The Big 4

Proposed Regex:

jtcohen6 commented Jul 21, 2022

callum-mcdata commented Jul 21, 2022

jtcohen6 commented Jul 21, 2022

leahwicz commented Aug 30, 2022

emmyoop commented Sep 8, 2022

callum-mcdata commented Jul 11, 2022 •

edited by jira bot

Loading