Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move SourceDefinition to dbt/artifacts #9543

Merged
merged 19 commits into from
Feb 15, 2024

Conversation

QMalcolm
Copy link
Contributor

@QMalcolm QMalcolm commented Feb 8, 2024

resolves #9384

Problem

We are moving data artifacts to dbt/artifacts in a piecewise fashion. We needed to move SourceDefinition as part of that.

Solution

Moved data portion of SourceDefinition to dbt/artifacts (and all other classes that doing so required)

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX
  • This PR includes type annotations for new and modified functions

We need to move the data parts of the `Time` definition to dbt/artifacts.
That is not what we're doing in this commit. In this commit we're simply
moving the functional `Time` definition upstream of `unparsed` and `nodes`.
This does two things
  - Mirrors the import path that the resource `time` definition will have in dbt/artifacts
  - Reduces the chance of ciricular import problems between `unparsed` and `nodes`
We need to move the data parts of the `FreshnessThreshold` definition to dbt/artifacts.
That is not what we're doing in this commit. In this commit we're simply
moving the functional `FreshnessThreshold` definition upstream of `unparsed` and `nodes`.
This does two things
  - Mirrors the import path that the resource `FreshnessThreshold` definition will have in dbt/artifacts
  - Reduces the chance of ciricular import problems between `unparsed` and `nodes`
Note: We had to override some of the attrs of the `FreshnessThreshold`
resource because the resource version only has access to the resource
version of `Time`. The overrides in the functional definition of
`FreshnessThreshold` make it so the attrs use the functional version
of `Time`.
This is a precursor to splitting `HasRelationMetadata` into it's
data and functional parts.
Something interesting here is that we had to override the `freshness`
property. We had to do this because if we didn't we wouldn't get the
functional parts of `FreshnessThreshold`, we'd only get the data parts.

Also of note, the `SourceDefintion` has a lot of `@property` methods that
on other classes would be actual attribute properties of the node. There is
an argument to be made that these should be moved as well, but thats perhaps
a separate discussion.

Finally, we have not (yet) moved `NodeInfoMixin`. It is an open discussion
whether we do or not. It seems primarily functional, as a means to update the
source freshness information. As the artifacts primarily deal with the shape
of the data, not how it should be set, it seems for now that `NodeInfoMixin`
should stay in core / not move to artifacts. This thinking may change though.
@QMalcolm QMalcolm requested a review from a team as a code owner February 8, 2024 20:08
@QMalcolm QMalcolm requested a review from gshank February 8, 2024 20:08
@cla-bot cla-bot bot added the cla:yes label Feb 8, 2024
Copy link

codecov bot commented Feb 8, 2024

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (581d856) 85.59% compared to head (e7fe26c) 87.97%.
Report is 9 commits behind head on main.

Files Patch % Lines
...re/dbt/artifacts/resources/v1/source_definition.py 94.44% 3 Missing ⚠️
core/dbt/contracts/graph/nodes.py 94.73% 2 Missing ⚠️
core/dbt/artifacts/resources/v1/components.py 98.43% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9543      +/-   ##
==========================================
+ Coverage   85.59%   87.97%   +2.38%     
==========================================
  Files         167      168       +1     
  Lines       22164    22208      +44     
==========================================
+ Hits        18971    19538     +567     
+ Misses       3193     2670     -523     
Flag Coverage Δ
integration 85.62% <96.57%> (+0.03%) ⬆️
unit 61.93% <92.57%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

quoting: Quoting = field(default_factory=Quoting)
loaded_at_field: Optional[str] = None
class SourceDefinition(
NodeInfoMixin,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some question on whether NodeInfoMixin should be moved, with arguments both ways. We can do so, perhaps it's worth discussing further.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline: NodeInfoMixin makes a lot of sense to keep as an internal concern of dbt-core, since it is actively set in the internal node + removed prior to serialization.

We could probably even remove the https://github.com/dbt-labs/dbt-core/pull/9543/files#diff-42cfc99b2fcd89ab75957b8999bee2b1af822f7d39953ca99f1f285c170557c1R1221 method!

The complexity is in ensuring that the SourceDefinition node is not serialized directly via Manifest.writable_manifest, but is first converted to a resource. We'll need to put a better pattern in place to avoid using nodes and resources interchangeably for the purposes of serialization

class SourceDefinition(NodeInfoMixin, ParsedSourceMandatory):
quoting: Quoting = field(default_factory=Quoting)
loaded_at_field: Optional[str] = None
class SourceDefinition(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class has a bunch of @property methods that are full on property attributes in other classes, things like depends_on, refs, and groups. I think they are property methods here to reduce the memory footprint? Because they're methods, I did not move them. However, there's an argument to be made that we should move them to the resource definition in dbt/artifacts because they are expected to have on instances of the class 🤔

core/dbt/contracts/graph/components.py Outdated Show resolved Hide resolved
core/dbt/contracts/graph/components.py Outdated Show resolved Hide resolved
core/dbt/contracts/graph/components.py Outdated Show resolved Hide resolved
In the next commit we're gonna add a `to_resource` method. As we don't
want to have to pass a resource into `to_resource`, the class itself
needs to expose what resource class should be built. Thus a type annotation
is no longer enough. To solve this we've added a class method to BaseNode
which returns the associated resource class. The method on BaseNode will
raise a NotImplementedError unless the the inheriting class has overridden
the `resouce_class` method to return the a resource class.

You may be thinking "Why not a class property"? And that is absolutely a
valid question. We used to be able to chain `@classmethod` with
`@property` to create a class property. However, this was deprecated in
python 3.11 and removed in 3.13 (details on why this happened can be found
[here](python/cpython#89519)). There is an
[alternate way to setup a class property](python/cpython#89519 (comment)),
however this seems a bit convoluted if a class method easily gets the job
done. The draw back is that we must do `.resource_class()` instead of
`.resource_class` and on classes implementing `BaseNode` we have to
override it with a method instead of a property specification.

Additionally, making it a class _instance_ property won't work because
we don't want to require an _instance_ of the class to get the
`resource_class` as we might not have an instance at our dispossal.
Nodes have extra attributes. We don't want these extra attributes to
get serialized. Thus we're converting back to resources prior to
serialization. There could be a CPU hit here as we're now dictifying
and undictifying right before serialization. We can do some complicated
and non-straight-forward things to get around this. However, we want
to see how big of a perforance hit we actually have before going that
route.
The method `__post_serialize__` on the `SourceDefinition` was used for
ensuring the property `_event_status` didn't make it to the serialized
version of the node. Now that resource definition of `SourceDefinition`
handles serialization/deserialization, we can drop `__post_serialize__`
as it is no longer needed.
@QMalcolm QMalcolm force-pushed the qmalcolm--9384-move-source-definition-to-dbt-artifacts branch from a77c4ed to c6fe862 Compare February 14, 2024 15:50
@MichelleArk MichelleArk added the artifact_minor_upgrade To bypass the CI check by confirming that the change is not breaking label Feb 14, 2024
@@ -1020,26 +1023,30 @@ def build_group_map(self):
group_map[node.group].append(node.unique_id)
self.group_map = group_map

@classmethod
def _map_nodes_to_map_resources(cls, nodes_map: MutableMapping[str, NodeClassT]):
return {name: node.to_resource() for name, node in nodes_map.items()}
Copy link
Contributor

@MichelleArk MichelleArk Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: node_id would be more precise than name

We discussed this on the PR. It seems like a minimal lift, and minimal to
support. Doing so also has the benefit of reducing a bunch of the overriding
we were previously doing.
exposures=self.exposures,
metrics=self.metrics,
exposures=self._map_nodes_to_map_resources(self.exposures),
metrics=self._map_nodes_to_map_resources(self.metrics),
groups=self.groups,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should also implement the resource_class method on Group in nodes.py and use self._map_nodes_to_map_resources on it here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

# complicated ways of making a class property, however a class method suits our
# purposes well enough
@classmethod
def resource_class(cls) -> Type[BaseResource]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@MichelleArk MichelleArk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small outstanding comment, otherwise looks amazing! Thank you ✨

@QMalcolm QMalcolm merged commit 8a1b927 into main Feb 15, 2024
52 checks passed
@QMalcolm QMalcolm deleted the qmalcolm--9384-move-source-definition-to-dbt-artifacts branch February 15, 2024 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
artifact_minor_upgrade To bypass the CI check by confirming that the change is not breaking cla:yes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-3554] Define SourceDefinition contract in dbt/artifacts
2 participants