-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describe External Assets; rename create_unexecutable_observable_assets_def to external_assets_from_specs #16754
Conversation
Current dependencies on/for this PR:
This comment was auto-generated by Graphite. |
4dc593c
to
71e5d29
Compare
Deploy preview for dagit-storybook ready! ✅ Preview Built with commit 71e5d29. |
Deploy preview for dagit-core-storybook ready! ✅ Preview Built with commit 2629ca6. |
Deploy preview for dagster-docs ready! Preview available at https://dagster-docs-obbuf9osz-elementl.vercel.app Direct link to changed pages: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have long thought that external assets
would be the right name for this feature so I am supportive! This feature resembles BigQuery's external tables feature so I think it will resonate with practitioners.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reservations around "external" have just been the existing meaning in the code base, but those artifacts are not part of the public API so can be renamed.
I think external is much better than unexecutable & observable. I don't think there have been any other real contenders.
I think the external language works well when describing these. In replacing source asset we have some dynamic scoping for what "external" refers to but i think it still makes sense (outside of dagster vs outside of a particular code location).
from dagster._core.definitions.decorators.asset_decorator import asset, multi_asset | ||
from dagster._core.definitions.source_asset import SourceAsset | ||
from dagster._core.errors import DagsterInvariantViolationError | ||
|
||
|
||
def create_unexecutable_observable_assets_def(specs: Sequence[AssetSpec]): | ||
def external_assets_def_from_specs(specs: Sequence[AssetSpec]) -> AssetsDefinition: | ||
"""Create an external assets definition from a sequence of asset specs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's a use case where you would create an external asset from multiple AssetSpecs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as an aside, doing this
a1 = AssetSpec(key="table_1")
a2 = AssetSpec(key="table_2")
a3 = AssetSpec(key="table_3", deps=[a1])
external_assets = external_assets_def_from_specs([a1, a2, a3])
causes errors when running the UI
dagster._core.errors.DagsterInvariantViolationError: Expected Tuple annotation for multiple outputs, but received non-tuple annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inconsequential to the behavior of this system. This is just a weird outcropping of the fact that our core abstraction is pluralized. I could change it to produce N asset defs if so desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might not fully understand, but the assets
in the fn name isn't my concern, it's that you can pass a list of AssetSpec
s. I don't really get why you would ever pass a list, so I'm wondering why it can't be
def external_assets_def_from_spec(spec: AssetSpec) -> AssetsDefinition:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more about convenience than anything. People will construct N asset specs to represent the graph and I want them to be able to coerce them into a form that can be threaded to Definitions
in a single fn call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok that makes sense. I will note that just with the doc bloc and function signature, I was pretty unsure about what would happen if i passed a list of AssetSpecs. And it does still seem to throw an error if you try to pass a list (but that's fixable). I think the initial confusion can be remedied by addressing what happens when you pass multiple AssetSpecs
in the doc block, or by generating an AssetsDefinition
per spec. While the latter is maybe less efficient/convenient, it fits my intuituve mental model of what I would expect this function to do, rather than a single multi_asset which feels like it should consist of assets that are related to each other in some way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm totally cool with creating multiple assets_def. @sryza is that what you would prefer as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that feels more grokkable to me. And I agree that it's ergonomic for it to take a list, rather than requiring the user to write a comprehension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this was a very good call. Error message if you somehow get to the user-defined execute function will be much more clear.
Interestingly, we already use the "external" language heavily in the concept docs where we present source assets: https://docs.dagster.io/concepts/assets/software-defined-assets#defining-external-asset-dependencies. This helps me feel comfortable with this change. However, it's a little difficult to understand the impact of this change without understanding how it affects the UI and other parts of how we communicate about Dagster. E.g. are there a set of places in the docs where we'll want to add text like "if the asset is an external asset, then it behaves like X instead?" I feel optimistic about the "external" naming, but the above gives me some apprehension about committing to it before we have the breathing room to understand the user-facing impact a little bit more. A separate, more tactical reservation I have is that it feels weird to build an All that said, exposing a helper function seems pretty low risk. |
|
👍🏻 It is not even essential that this go out with 1.5. It is purely additive on top so can be in a follow on. What is important is aligning on visioning and naming prior to launch week event. |
Nothing new for me to add, the name is a big improvement over unexecutable. I think there are some questions around docs and how we want to communicate this to our users. I think API docs are part of it, but probably will warrant further prose too. |
Roger that. It appears we have consensus on external. We will land this at a reasonable pace, after the pipes renaming has landed. It doesn't not need to land in 1.5 to talk about it at Launch Week. However I want to make it clear that I have very high conviction that this structure replaces source assets and observable source assets, and post launch week will subsequently––as part of our quality/consolidation push––do the following:
This has been a laceration in our product and architecture for too long, and I look forward to eliminating it. |
…ssets_def_from_specs to external_asset_def_from_specs docs snapshot snaps
ca4d19c
to
2629ca6
Compare
Deploy preview for dagster-university ready! ✅ Preview Built with commit 2629ca6. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to land this but not export it top-level.
…s_def to external_assets_from_specs (#16754) ## Summary & Motivation This PR renames `create_unexecutable_observable_assets_def` to `external_assets_from_specs`. I also want this PR to serve as the final discussion on naming this feature and documents this capability. The verbiage in the docblock: ``` Create an external assets definition from a sequence of asset specs. An external asset is an asset that is not materialized by Dagster, but is tracked in the asset graph and asset catalog. A common use case for external assets is modeling data produced by an process not under Dagster's control. For example a daily drop of a file from a third party in s3. In most systems these are described as sources. This includes Dagster, which includes :py:class:`SourceAsset`, which will be supplanted by external assets in the near-term future, as external assets are a superset of the functionality of Source Assets. External assets can act as sources, but that is not their only use. In particular, external assets have themselves have lineage-specified through the ``deps`` argument of :py:class:`AssetSpec`- and can depend on other external assets. External assets are not allowed to depend on non-external assets. The user can emit `AssetMaterialization`, `AssetObservation`, and `AssetCheckEvaluations` events attached external assets. And Dagster now has the ability to have "runless" events to enable many use cases that were previously not possible. Runless events are events generated outside the context of a particular run (for example, in a sensor or by an script), allowing for greater flexibility in event generation. This can be done in a few ways: Note to reviewers that this in an in-progress doc block and the below will have links and examples. 1) DagsterInstance exposes `report_runless_event` that can be used to generate events for external assets directly on an instance. See docs. 2) Sensors can build these events and return them using :py:class:`SensorResult`. A use case for this is using a sensor to continously monitor the metadata exhaust from an external system and inserting events that reflect that exhaust. See docs. 3) Dagster Cloud exposes a REST API for ingesting runless events. Users can copy and paste a curl command in the their external computations (such as Airflow operator) to register metadata associated with those computations See docs. 4) Dagster ops can generate these events directly and yield them or by calling ``log_event`` on :py:class:`OpExecutionContext`. Use cases for this include querying metadata in an external system that is too expensive to do so in a sensor. Or for adapting pure op-based Dagster code to take advantage of asset-oriented lineage, observability, and data quality features, without having to port them wholesale to `@asset`- and `@multi_asset`-based code. This feature set allows users to use Dagster as an observability, lineage, and data quality tool for assets that are not materialized by Dagster. In addition to traditional use cases like sources, this feature can model entire lineage graphs of assets that are scheduled and materialized by other tools and workflow engines. This allows users to use Dagster as a cross-cutting observability tool without migrating their entire data platform to a single orchestration engine. External assets do not have all the features of normal assets: they cannot be materialized ad hoc by Dagster (this is diabled in the UI); cannot be backfilled; cannot be scheduled using auto-materialize policies; and opt out of other features around direct materialization, both now and in the future. External assets also provide fewer guarantees around the correctness of information of their information in the asset catalog. In other words, in exchange for the flexibility Dagster provides less guardrails for external assets than assets that are materialized by Dagster, and there is an increased chance that they will insert non-sensical information into the asset catalog, potentially eroding trust. ``` Suggesting alternative lanuage in this docblock is the best way to talk about an alternative name IMO. ## How I Tested These Changes
## Summary & Motivation Adds an External Assets concept page (motivation described in #16754). This also contains a code change necessary because of the bug demonstrated in #17077. ## How I Tested These Changes BK. Also loaded examples in `dagster dev` --------- Co-authored-by: Erin Cochran <[email protected]> Co-authored-by: Yuhan Luo <[email protected]>
Adds an External Assets concept page (motivation described in #16754). This also contains a code change necessary because of the bug demonstrated in #17077. BK. Also loaded examples in `dagster dev` --------- Co-authored-by: Erin Cochran <[email protected]> Co-authored-by: Yuhan Luo <[email protected]>
Summary & Motivation
This PR renames
create_unexecutable_observable_assets_def
toexternal_assets_from_specs
. I also want this PR to serve as the final discussion on naming this feature and documents this capability. The verbiage in the docblock:Suggesting alternative lanuage in this docblock is the best way to talk about an alternative name IMO.
How I Tested These Changes