-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SemanticModel Node Type #7769
Changes from 10 commits
17030ea
4b8d8a8
2fea3d6
9203b17
a79ad1f
27e13b8
84a47e1
85c4369
6bfc766
41ab90a
2406884
870402e
c410a65
a2de21a
6e70a2e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
kind: Features | ||
body: Added support for parsing and serializaing semantic models | ||
time: 2023-06-06T16:53:51.117429-04:00 | ||
custom: | ||
Author: peterallenwebb | ||
Issue: 7499 7503 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,21 +25,22 @@ | |
from dbt.contracts.publication import ProjectDependencies, PublicationConfig, PublicModel | ||
|
||
from dbt.contracts.graph.nodes import ( | ||
Macro, | ||
BaseNode, | ||
Documentation, | ||
SourceDefinition, | ||
GenericTestNode, | ||
Exposure, | ||
Metric, | ||
GenericTestNode, | ||
GraphMemberNode, | ||
Group, | ||
UnpatchedSourceDefinition, | ||
Macro, | ||
ManifestNode, | ||
GraphMemberNode, | ||
ResultNode, | ||
BaseNode, | ||
ManifestOrPublicNode, | ||
Metric, | ||
ModelNode, | ||
RelationalNode, | ||
ResultNode, | ||
SemanticModel, | ||
SourceDefinition, | ||
UnpatchedSourceDefinition, | ||
) | ||
from dbt.contracts.graph.unparsed import SourcePatch, NodeVersion, UnparsedVersion | ||
from dbt.contracts.graph.manifest_upgrade import upgrade_manifest_json | ||
|
@@ -706,6 +707,7 @@ class Manifest(MacroMethods, DataClassMessagePackMixin, dbtClassMixin): | |
public_nodes: MutableMapping[str, PublicModel] = field(default_factory=dict) | ||
project_dependencies: Optional[ProjectDependencies] = None | ||
publications: MutableMapping[str, PublicationConfig] = field(default_factory=dict) | ||
semantic_models: MutableMapping[str, SemanticModel] = field(default_factory=dict) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The existing pattern is the node objects use "Model" and the dictionaries use "nodes", that is Model nodes are found in "nodes" dictionary, "PublicModel" nodes are found in public_nodes. Could this be the "semantic_nodes" to preserve that pattern? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In that case, do you think we just rely on the existing nodes dictionary, and not have a special case for SemanticModel objects? Would it matter that the SemanticModels are not yet linked into the compiled graph? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The original "metrics" went in a "metrics" dictionary... This has been a bit ad-hoc and not closely thought through, but it feels like the existing pattern is that SQL things go in nodes (plus seeds...) and yaml-generated things go in their own dictionaries. There are some assumptions in the rest of the code that match, such as the links in the file objects, etc. So I think it still makes sense to put semantic models in their own dictionary. If anything I'd be tempted to separate out some of the existing things that are in the nodes dictionary into their own dictionaries and maybe make some combined "indexes" (for cases where we don't want to loop over "nodes" all the time...) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just wanted to mention that one of the reasons for going in the direction of more individual dictionaries rather than less, is that currently jsonschema and deserializers can't always correctly guess the classes of the objects in the "nodes" dictionary, leading to other hacky things like the big deserialization if statement in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is all really good to know. It seems like the right direction to me. |
||
|
||
_doc_lookup: Optional[DocLookup] = field( | ||
default=None, metadata={"serialize": lambda x: None, "deserialize": lambda x: None} | ||
|
@@ -894,7 +896,7 @@ def build_group_map(self): | |
group_map[node.group].append(node.unique_id) | ||
self.group_map = group_map | ||
|
||
def writable_manifest(self): | ||
def writable_manifest(self) -> "WritableManifest": | ||
self.build_parent_and_child_maps() | ||
self.build_group_map() | ||
return WritableManifest( | ||
|
@@ -912,6 +914,7 @@ def writable_manifest(self): | |
child_map=self.child_map, | ||
parent_map=self.parent_map, | ||
group_map=self.group_map, | ||
semantic_models=self.semantic_models, | ||
) | ||
|
||
def write(self, path): | ||
|
@@ -1246,6 +1249,11 @@ def add_doc(self, source_file: SourceFile, doc: Documentation): | |
self.docs[doc.unique_id] = doc | ||
source_file.docs.append(doc.unique_id) | ||
|
||
def add_semantic_model(self, source_file: SchemaSourceFile, semantic_model: SemanticModel): | ||
peterallenwebb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
_check_duplicates(semantic_model, self.semantic_models) | ||
self.semantic_models[semantic_model.unique_id] = semantic_model | ||
source_file.semantic_models.append(semantic_model.unique_id) | ||
|
||
# end of methods formerly in ParseResult | ||
|
||
# Provide support for copy.deepcopy() - we just need to avoid the lock! | ||
|
@@ -1345,6 +1353,9 @@ class WritableManifest(ArtifactMixin): | |
public_nodes: Mapping[UniqueID, PublicModel] = field( | ||
metadata=dict(description=("The public models used in the dbt project")) | ||
) | ||
semantic_models: Mapping[UniqueID, SemanticModel] = field( | ||
metadata=dict(description=("The semantic models defined in the dbt project")) | ||
) | ||
metadata: ManifestMetadata = field( | ||
metadata=dict( | ||
description="Metadata about the manifest", | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,29 +6,23 @@ | |
import hashlib | ||
|
||
from mashumaro.types import SerializableType | ||
from typing import ( | ||
Optional, | ||
Union, | ||
List, | ||
Dict, | ||
Any, | ||
Sequence, | ||
Tuple, | ||
Iterator, | ||
) | ||
from typing import Optional, Union, List, Dict, Any, Sequence, Tuple, Iterator, Protocol | ||
|
||
from dbt.dataclass_schema import dbtClassMixin, ExtensibleDbtClassMixin | ||
|
||
from dbt.clients.system import write_file | ||
from dbt.contracts.files import FileHash | ||
from dbt.contracts.graph.unparsed import ( | ||
Dimension, | ||
Docs, | ||
Entity, | ||
ExposureType, | ||
ExternalTable, | ||
FreshnessThreshold, | ||
HasYamlMetadata, | ||
MacroArgument, | ||
MaturityType, | ||
Measure, | ||
MetricFilter, | ||
MetricTime, | ||
Owner, | ||
|
@@ -62,12 +56,6 @@ | |
EmptySnapshotConfig, | ||
SnapshotConfig, | ||
) | ||
import sys | ||
|
||
if sys.version_info >= (3, 8): | ||
from typing import Protocol | ||
else: | ||
from typing_extensions import Protocol | ||
|
||
|
||
# ===================================================================== | ||
|
@@ -564,6 +552,30 @@ def depends_on_macros(self): | |
return self.depends_on.macros | ||
|
||
|
||
@dataclass | ||
class FileSlice(dbtClassMixin, Replaceable): | ||
peterallenwebb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"""Provides file slice level context about what something was created from. | ||
|
||
Implementation of the dbt-semantic-interfaces `FileSlice` protocol | ||
""" | ||
|
||
filename: str | ||
content: str | ||
start_line_number: int | ||
end_line_number: int | ||
|
||
|
||
@dataclass | ||
class Metadata(dbtClassMixin, Replaceable): | ||
peterallenwebb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"""Provides file context about what something was created from. | ||
|
||
Implementation of the dbt-semantic-interfaces `Metadata` protocol | ||
""" | ||
|
||
repo_file_path: str | ||
file_slice: FileSlice | ||
|
||
|
||
# ==================================== | ||
# CompiledNode subclasses | ||
# ==================================== | ||
|
@@ -1411,6 +1423,28 @@ class Group(BaseNode): | |
resource_type: NodeType = field(metadata={"restrict": [NodeType.Group]}) | ||
|
||
|
||
# ==================================== | ||
# SemanticModel and related classes | ||
# ==================================== | ||
|
||
|
||
@dataclass | ||
class NodeRelation(dbtClassMixin): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Stu added in a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That one doesn't actually have the relation_name in it, and I think they want that. Settling on a common relation class is the "larger issue" that I mentioned we didn't need to solve at this particular moment. |
||
alias: str | ||
schema_name: str # TODO: Could this be called simply "schema" so we could reuse StateRelation? | ||
database: Optional[str] = None | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From this comment -- does this still need to include a unified relation_name that respects quoting + include policies? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if we do -- we'll probably need the adapter.Relation to do something like this during compilation: adapter = get_adapter(self.config)
relation_cls = adapter.Relation
relation_name = str(relation_cls.create_from(self.config, node)) (from https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/compilation.py#L486-L488) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think the SemanticModels actually get compiled, since they're yaml-only. There is a question of whether they need the individual pieces (identifier/schema/database) or just the relation_name... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @QMalcolm I'm not sure of the answer to this, but I suspect the answer may be yes. What makes sense from the perspective of MetricFlow integration? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @MichelleArk Yep! We need to include the unified relation_name, and it's desirable that it respects quoting + include policies. As for whether we need both the individual pieces as well as the |
||
|
||
|
||
@dataclass | ||
class SemanticModel(GraphNode): | ||
description: Optional[str] | ||
model: str | ||
node_relation: Optional[NodeRelation] | ||
entities: Sequence[Entity] | ||
measures: Sequence[Measure] | ||
dimensions: Sequence[Dimension] | ||
|
||
|
||
# ==================================== | ||
# Patches | ||
# ==================================== | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -57,6 +57,7 @@ | |
DeprecatedReference, | ||
UpcomingReferenceDeprecation, | ||
) | ||
from dbt_extractor import py_extract_from_source # type: ignore | ||
from dbt.logger import DbtProcessState | ||
from dbt.node_types import NodeType, AccessType | ||
from dbt.clients.jinja import get_rendered, MacroStack | ||
|
@@ -99,6 +100,7 @@ | |
ManifestNode, | ||
ResultNode, | ||
ModelNode, | ||
NodeRelation, | ||
) | ||
from dbt.contracts.graph.unparsed import NodeVersion | ||
from dbt.contracts.util import Writable | ||
|
@@ -528,6 +530,7 @@ def load(self): | |
self.process_refs(self.root_project.project_name) | ||
self.process_docs(self.root_project) | ||
self.process_metrics(self.root_project) | ||
self.process_semantic_models() | ||
self.check_valid_group_config() | ||
|
||
# update tracking data | ||
|
@@ -1176,6 +1179,28 @@ def process_metrics(self, config: RuntimeConfig): | |
continue | ||
_process_metrics_for_node(self.manifest, current_project, exposure) | ||
|
||
def process_semantic_models(self) -> None: | ||
for semantic_model in self.manifest.semantic_models.values(): | ||
if semantic_model.model: | ||
statically_parsed = py_extract_from_source(f"{{{{ {semantic_model.model} }}}}") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this call can throw an There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree, and we'll follow up in the current sprint with #7822. |
||
if statically_parsed["refs"]: | ||
|
||
ref = statically_parsed["refs"][0] | ||
if len(ref) == 2: | ||
input_package_name, input_model_name = ref | ||
else: | ||
input_package_name, input_model_name = None, ref[0] | ||
|
||
refd_node = self.manifest.ref_lookup.find( | ||
input_model_name, input_package_name, None, self.manifest | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could we add a #TODO comment here that links to #7688? refs with the version keyword won't be supported until 7688 is closed. At that point we can pass in the parsed version instead of always passing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree, and we'll follow up in the current sprint with #7822. |
||
) | ||
if isinstance(refd_node, ModelNode): | ||
semantic_model.node_relation = NodeRelation( | ||
alias=refd_node.alias, | ||
schema_name=refd_node.schema, | ||
database=refd_node.database, | ||
) | ||
MichelleArk marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# nodes: node and column descriptions | ||
# sources: source and table descriptions, column descriptions | ||
# macros: macro argument descriptions | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,13 @@ | ||
from dbt.parser.schemas import YamlReader, SchemaParser | ||
from dbt.parser.common import YamlBlock | ||
from dbt.node_types import NodeType | ||
from dbt.contracts.graph.unparsed import UnparsedExposure, UnparsedMetric, UnparsedGroup | ||
from dbt.contracts.graph.nodes import Exposure, Metric, Group | ||
from dbt.contracts.graph.unparsed import ( | ||
UnparsedExposure, | ||
UnparsedGroup, | ||
UnparsedMetric, | ||
UnparsedSemanticModel, | ||
) | ||
from dbt.contracts.graph.nodes import Exposure, Group, Metric, SemanticModel | ||
from dbt.exceptions import DbtInternalError, YamlParseDictError, JSONValidationError | ||
from dbt.context.providers import generate_parse_exposure, generate_parse_metrics | ||
from dbt.contracts.graph.model_config import MetricConfig, ExposureConfig | ||
|
@@ -269,3 +274,46 @@ def parse(self): | |
raise YamlParseDictError(self.yaml.path, self.key, data, exc) | ||
|
||
self.parse_group(unparsed) | ||
|
||
|
||
class SemanticModelParser(YamlReader): | ||
def __init__(self, schema_parser: SchemaParser, yaml: YamlBlock): | ||
super().__init__(schema_parser, yaml, "semantic_models") | ||
self.schema_parser = schema_parser | ||
self.yaml = yaml | ||
|
||
def parse_semantic_model(self, unparsed: UnparsedSemanticModel): | ||
package_name = self.project.project_name | ||
unique_id = f"{NodeType.SemanticModel}.{package_name}.{unparsed.name}" | ||
path = self.yaml.path.relative_path | ||
|
||
fqn = self.schema_parser.get_fqn_prefix(path) | ||
fqn.append(unparsed.name) | ||
|
||
parsed = SemanticModel( | ||
description=unparsed.description, | ||
fqn=fqn, | ||
model=unparsed.model, | ||
name=unparsed.name, | ||
node_relation=None, # Resolved from the value of "model" after parsing | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for my own understanding -- we need to resolve the relation later on because not all model nodes are available in the manifest for lookup at this point? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we make the node relation evaluate lazily, so we can put its value in here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SemanticModels come from yaml only, and the referenced models should always be sql based and so already resolvable when this is parsed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @MichelleArk Yes, I discussed this with Gerda and she thought we could possibly handle it immediately, since models would likely have been created, but doing it at the end would be most likely to succeed without complications. @aranke I'm can follow up the idea of lazy evaluation in the current sprint. If you have a specific method of doing it in mind can you add it as a comment to #7822? |
||
original_file_path=self.yaml.path.original_file_path, | ||
package_name=package_name, | ||
path=path, | ||
resource_type=NodeType.SemanticModel, | ||
unique_id=unique_id, | ||
entities=unparsed.entities, | ||
measures=unparsed.measures, | ||
dimensions=unparsed.dimensions, | ||
) | ||
|
||
self.manifest.add_semantic_model(self.yaml.file, parsed) | ||
|
||
def parse(self): | ||
for data in self.get_key_dicts(): | ||
try: | ||
UnparsedSemanticModel.validate(data) | ||
unparsed = UnparsedSemanticModel.from_dict(data) | ||
except (ValidationError, JSONValidationError) as exc: | ||
raise YamlParseDictError(self.yaml.path, self.key, data, exc) | ||
|
||
self.parse_semantic_model(unparsed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
semantic_nodes? (see other comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gshank To clarify, do you think we should change the name in the yml file, or just for our internal property names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for for the internal property names. I think "semantic_models" in the yaml files is fine, it matches the "models" section we already have. Naming is so hard.