-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move SourceDefinition
to dbt/artifacts
#9543
Move SourceDefinition
to dbt/artifacts
#9543
Conversation
We need to move the data parts of the `Time` definition to dbt/artifacts. That is not what we're doing in this commit. In this commit we're simply moving the functional `Time` definition upstream of `unparsed` and `nodes`. This does two things - Mirrors the import path that the resource `time` definition will have in dbt/artifacts - Reduces the chance of ciricular import problems between `unparsed` and `nodes`
We need to move the data parts of the `FreshnessThreshold` definition to dbt/artifacts. That is not what we're doing in this commit. In this commit we're simply moving the functional `FreshnessThreshold` definition upstream of `unparsed` and `nodes`. This does two things - Mirrors the import path that the resource `FreshnessThreshold` definition will have in dbt/artifacts - Reduces the chance of ciricular import problems between `unparsed` and `nodes`
Note: We had to override some of the attrs of the `FreshnessThreshold` resource because the resource version only has access to the resource version of `Time`. The overrides in the functional definition of `FreshnessThreshold` make it so the attrs use the functional version of `Time`.
…odule in dbt/artifacts
This is a precursor to splitting `HasRelationMetadata` into it's data and functional parts.
Something interesting here is that we had to override the `freshness` property. We had to do this because if we didn't we wouldn't get the functional parts of `FreshnessThreshold`, we'd only get the data parts. Also of note, the `SourceDefintion` has a lot of `@property` methods that on other classes would be actual attribute properties of the node. There is an argument to be made that these should be moved as well, but thats perhaps a separate discussion. Finally, we have not (yet) moved `NodeInfoMixin`. It is an open discussion whether we do or not. It seems primarily functional, as a means to update the source freshness information. As the artifacts primarily deal with the shape of the data, not how it should be set, it seems for now that `NodeInfoMixin` should stay in core / not move to artifacts. This thinking may change though.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #9543 +/- ##
==========================================
+ Coverage 85.59% 87.97% +2.38%
==========================================
Files 167 168 +1
Lines 22164 22208 +44
==========================================
+ Hits 18971 19538 +567
+ Misses 3193 2670 -523
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
quoting: Quoting = field(default_factory=Quoting) | ||
loaded_at_field: Optional[str] = None | ||
class SourceDefinition( | ||
NodeInfoMixin, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some question on whether NodeInfoMixin
should be moved, with arguments both ways. We can do so, perhaps it's worth discussing further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline: NodeInfoMixin makes a lot of sense to keep as an internal concern of dbt-core, since it is actively set in the internal node + removed prior to serialization.
We could probably even remove the https://github.com/dbt-labs/dbt-core/pull/9543/files#diff-42cfc99b2fcd89ab75957b8999bee2b1af822f7d39953ca99f1f285c170557c1R1221 method!
The complexity is in ensuring that the SourceDefinition node is not serialized directly via Manifest.writable_manifest, but is first converted to a resource. We'll need to put a better pattern in place to avoid using nodes and resources interchangeably for the purposes of serialization
class SourceDefinition(NodeInfoMixin, ParsedSourceMandatory): | ||
quoting: Quoting = field(default_factory=Quoting) | ||
loaded_at_field: Optional[str] = None | ||
class SourceDefinition( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class has a bunch of @property
methods that are full on property attributes in other classes, things like depends_on
, refs
, and groups
. I think they are property methods here to reduce the memory footprint? Because they're methods, I did not move them. However, there's an argument to be made that we should move them to the resource definition in dbt/artifacts because they are expected to have on instances of the class 🤔
In the next commit we're gonna add a `to_resource` method. As we don't want to have to pass a resource into `to_resource`, the class itself needs to expose what resource class should be built. Thus a type annotation is no longer enough. To solve this we've added a class method to BaseNode which returns the associated resource class. The method on BaseNode will raise a NotImplementedError unless the the inheriting class has overridden the `resouce_class` method to return the a resource class. You may be thinking "Why not a class property"? And that is absolutely a valid question. We used to be able to chain `@classmethod` with `@property` to create a class property. However, this was deprecated in python 3.11 and removed in 3.13 (details on why this happened can be found [here](python/cpython#89519)). There is an [alternate way to setup a class property](python/cpython#89519 (comment)), however this seems a bit convoluted if a class method easily gets the job done. The draw back is that we must do `.resource_class()` instead of `.resource_class` and on classes implementing `BaseNode` we have to override it with a method instead of a property specification. Additionally, making it a class _instance_ property won't work because we don't want to require an _instance_ of the class to get the `resource_class` as we might not have an instance at our dispossal.
Nodes have extra attributes. We don't want these extra attributes to get serialized. Thus we're converting back to resources prior to serialization. There could be a CPU hit here as we're now dictifying and undictifying right before serialization. We can do some complicated and non-straight-forward things to get around this. However, we want to see how big of a perforance hit we actually have before going that route.
The method `__post_serialize__` on the `SourceDefinition` was used for ensuring the property `_event_status` didn't make it to the serialized version of the node. Now that resource definition of `SourceDefinition` handles serialization/deserialization, we can drop `__post_serialize__` as it is no longer needed.
a77c4ed
to
c6fe862
Compare
core/dbt/contracts/graph/manifest.py
Outdated
@@ -1020,26 +1023,30 @@ def build_group_map(self): | |||
group_map[node.group].append(node.unique_id) | |||
self.group_map = group_map | |||
|
|||
@classmethod | |||
def _map_nodes_to_map_resources(cls, nodes_map: MutableMapping[str, NodeClassT]): | |||
return {name: node.to_resource() for name, node in nodes_map.items()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: node_id
would be more precise than name
We discussed this on the PR. It seems like a minimal lift, and minimal to support. Doing so also has the benefit of reducing a bunch of the overriding we were previously doing.
…ources` Naming is hard. That is all.
core/dbt/contracts/graph/manifest.py
Outdated
exposures=self.exposures, | ||
metrics=self.metrics, | ||
exposures=self._map_nodes_to_map_resources(self.exposures), | ||
metrics=self._map_nodes_to_map_resources(self.metrics), | ||
groups=self.groups, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should also implement the resource_class method on Group in nodes.py and use self._map_nodes_to_map_resources on it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
# complicated ways of making a class property, however a class method suits our | ||
# purposes well enough | ||
@classmethod | ||
def resource_class(cls) -> Type[BaseResource]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✨
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small outstanding comment, otherwise looks amazing! Thank you ✨
resolves #9384
Problem
We are moving data artifacts to
dbt/artifacts
in a piecewise fashion. We needed to moveSourceDefinition
as part of that.Solution
Moved data portion of
SourceDefinition
todbt/artifacts
(and all other classes that doing so required)Checklist