-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annotate a Dataset Event in the Source Task #37810
Comments
LGTM, looking forward to this 👍 |
Hi @uranusjr I was thinking of the same/similar feature like many many weeks - especially in data driven use cases. We also have a DAG that potentially generates dataset events - but in our use case we need to add a context - imagine like the file name in the S3 dataset or a UUID. And as this context is needed and is dynamic, would be a waste to crrate 1000's of datasets for 1000's events on specific files in the S3. I like the idea to attach
Pro would be that existing |
I like the idea. How would this work if the task writes to more than one dataset though? Another thing I’ve been thinking is to give XCom a dataset URI so we can track lineage of its values (also tieing back to the read/write to XCom via Object Store idea). This raises a question, what should we do if we want to use XCom for both the “actual” data, if it is already used for extra? Eventually what I think we should do is to provide some sort of “output management” mechanism that generalises XCom—if XCom is a kind of dataset, its metadata is conceptually just automatically populated dataset metadata. So the return value should still be the actual data we want to write (with where and how the data is stored being customisable), and downstream tasks depend on, and metadata should be provided by another way. I’m not entire sure how the end result should look like, or how to smoothly transition toward it. |
I believe might be an option as extension to also be able to pick which XCom as alternative output from the task is used to fill the extra. Might be another increment or if there is a concrete demand it could be made also right here.
I understand the idea of XCom with dataset URI. But would this URI refer to a specific DAG run or be the abstract "last" run? One would be a "moving target" and the other would be "a dataset URI per run"==many many URIs to track... or I mis-understand. Can you give an example?
When you ask this question I understand this would add a new complex area of XCom management and data flow. At the moment Xcom is quite simple to be used as Key/Value pair to pass data. It is not conforming to a schema (e.g. JSON validation/pydantic model) and can be any type. |
Since XCom is just a data storage, it can be used like an external S3 file, or a database the user sets up. It is just a bit more automated and contains some metadata. I feel it is reasonable to assign a dataset URI to each key-value pair, so a dataset event is triggered when a key-value pair is written. This makes cross-DAG XCom usage more useful IMO since it allows a downstream DAG to declare dependency at the DAG level (via dataset) to the upstream. With that established, if we store extra metadata (of a dataset), it only makes sense to allow extra metadata also when an XCom is written. But if we use XCom for the extra, writing to XCom would write… extra metadata to XCom? And does that extra metadata also has a dataset URI and can have extra extra metadata? It becomes awkward. |
I would see it as "we use existing XCom meta data" but not add new one. The data is just copied to the next DAGrun Conf on top. What therefore comes into my mind: We could also add a flag to the |
So if the marker is set, the return value goes to the dataset event’s extra, instead of (not in addition to) the I think what makes me feel uncomfortable about using XCom is that the model doesn’t contain a special semantic to data stored in it. It is more likely at least some people use it as a generic storage for data, instead of metadata (of the data). This means we can’t have a guaranteed way to tell if a value in there is supposed to be metadata (that’s associated to another data), or random data. But if metadata does not go into the table (but somewhere else) instead), I think that’s fine. Anotherway to do this would be to introduce a special type to return from a task function, like from airflow.datasets import Dataset, Metadata
from airflow.decorators import task
@task(outlets=[Dataset("s3://my/data.json")])
def my_task():
with ObjectStoragePath("s3://my/data.json").open("w") as f:
... # Write to file...
return Metadata(uri="s3://my/data.json", extra={"extra": "metadata"}) This is maybe more visible than setting a flag # easier to miss?
@task(outlets=[Dataset("s3://my/data.json", event_extra_source="xcom")])
def my_task():
with ObjectStoragePath("s3://my/data.json").open("w") as f:
... # Write to file...
# Need to double check above to understand what this return implies.
return {"extra": "metadata"} |
I gave this a pretty long thought. I am leaning to implementing the @task(outlets=[Dataset("s3://my/data.json")])
def my_task():
with ObjectStoragePath("s3://my/data.json").open("w") as f:
... # Write to file...
yield Metadata(uri="s3://my/data.json", extra={"extra": "metadata"})
return data # This goes to XCom! The thing I particualrly like about this is that in the future, when XCom gets its own lineage information and can also take additional metadata, we can also introduce another special type to allow passing int data and metadata at the same time: @task(outlets=[Dataset("s3://my/data.json")])
def my_task():
with ObjectStoragePath("s3://my/data.json").open("w") as f:
... # Write to file...
return Output(data, extra={"extra": "metadata"}) This also opens the door for sending multiple things from one single function if we allow That said, I think implementing the context-based approach is still a good first step toward all this. Even with the more magical and convenient return-as-metadata syntax, using a context variable is still explicit and may be preferred by some. It is also easier to implement, and should be a good way to start things rolling without getting into a ton of syntax design but focus on the core feature here. So I’m going to start with that first. |
Like to have this option. Also thought of this. Had the idea of "future extension" as well with the primary intend to keep it simple first :-D
Looking forward to a PR. What do you think if maybe your any my proposal are both possible, mostly based on how the used needs it? In many cases the extra information might not be used at-all or with the existing mechanisms is totally fine. Something like:
or
..whereas in your notation with the |
Core mechanism to set |
Description
To eventually support the construct and UI we’re aiming for in assets, we need to attach metadata to the actual data, not the task that produces it, nor the location it is written to.
In the task-based web UI, we can show those attached metadata in the task that emits the dataset event, to give an impression the metadata is directly associated to the task. In the implementation, however, the metadata would only be associated to the dataset, and only indirectly related to the task by the fact that the task emits the event.
Use case/motivation
An Airflow task generating data may want to attach information about it. Airflow does not currently provide a good interface for this. The only thing that resembles such feature is to attach and extra dict on
Dataset
like this:This is however quite limiting. It may be good enough for static information such as who owns this data, but not for information that is only known at runtime, to provide additional context to the generated data.
Store runtime-populated extras on
DatasetEvent
When a Dataset event is emitted, the corresponding
DatasetEvent
model in the database already has a field callextra
. This is however currently not populated when the event is generated from a task outlet (only when it’s created via the REST API).A previous design discussion contains the following comment from @blag:
and
However, I would argue that user code in an Airflow DAG should also have the ability to store custom information. While the information is readable in downstream tasks—thus technically is a mechanism to pass data between tasks—the main intention behind the design is instead to annotate the generated data, and does not go against the original design.
Provide extras at runtime
The task function (either
@task
-decorated, or a classic operator’sexecute
) will be able to attach values to a Dataset URI in the function. This is done by an accessor proxy under the keydataset_events
, so in the task function you can:After the task function’s execution, the extras provided dynamically are written to the
DatasetEvent
entry generated for the Dataset. Do note specially this is entirely distinct fromextra
onDataset
.Instead of using URI, you can use the dataset object directly to access the proxy:
Example using the context dict instead:
With a classic operator:
Show dataset event extras in web UI
Both dataset and dataset event extras currently have zero visibility to users in the web UI. This is somewhat acceptable for datasets, where the extra dict is static, but is a problem for dynamically generated values. Additional components should be added to the web UI to display extras emitted by a dataset event.
An obvious first addition would be to add a tables in the task instance panel in the Grid view when the task instance emits dataset events with extras. Quick UI mock:
Each key and value will simply be stringified to be displayed in the table. This should be enough for simple data since the extra dict currently need to be JSON-compatible. We can discuss richer data (similar to how Jupyter displays a DataFrame), and putting this information in other places (e.g. in the Dataset view) in the future.
Related issues
#35297
#36075
Are you willing to submit a PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: