Add dataset update endpoint #29433

michaelmicheal · 2023-02-08T19:05:36Z

Closes: #29162
To support integration with external services that logically update real-world datasets, this PR

makes task_instance optional for registering a dataset
adds a post endpoint to create dataset events through the API

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

ephraimbuddy

I'm not sure of this, like the broadcasted event has no source dag/task etc.
cc: @dstandish

airflow/api_connexion/openapi/v1.yaml

airflow/api_connexion/endpoints/dataset_endpoint.py

michaelmicheal · 2023-02-09T14:41:26Z

I'm not sure of this, like the broadcasted event has no source dag/task etc. cc: @dstandish

There's a few reasons why I think it's super important to at least support (not necessarily encourage) external dataset changes.

Integrate with external services and non-Airflow components of a pipeline. If a data science team has an external component of an ETL pipeline (for example data warehouse ingestion), these external services should be able to trigger workflows that depend on datasets when updated externally.
Support multi-instance Airlfow architectures. With astro, cloud composer, and custom solutions (like us at Shopify), using multiple Airflow instance in production is very common. When one layer of the data platform is orchestrated in one instance, and another layer is orchestrated in a different instance, we rely on being able to broadcast dataset changes between Airflow instances. We need this integration to be able to pass dataset changes between Airflow instances through the API.

airflow/api_connexion/endpoints/dataset_endpoint.py

airflow/datasets/manager.py

bolkedebruin · 2023-02-20T09:21:42Z

airflow/datasets/manager.py

+        else:
+            # When an external dataset change is made through the API, it isn't triggered by a task instance,
+            # so we create a DatasetEvent without the task and dag data.
+            dataset_event = DatasetEvent(


It would be great to have extra information available when the dataset has externally changed such as:

by whom - external_auth_id or external_service_id -> required

from where (api, client_ip / remote_addr) - external_source -> required

the timestamp of the actual event - so it can be reconciled if required -> Nullable as it might not be available

This ensures lineage isn't broken across systems

What do you think of the latest changes @bolkedebruin?

airflow/models/dataset.py

airflow/api_connexion/endpoints/dataset_endpoint.py

bolkedebruin · 2023-02-27T08:35:18Z

So, I like where this is going, but I'd like some extra robustness / proper security (see above). Furthermore, we need to to think how this API will be used.

For example, I expect the majority of usage to come from cloud storage integration. S3 (+Minio), GCS, ABS all use their own callback schema, which we ideally allow providers to register these kind of callbacks. The question becomes how to 'detect' with what service we are integrating without creating a lot of work for ourselves by needing to expose every flavor of callback as a separate API. I quite understand that this is beyond the scope of your PR, but it gives a dot on the horizon so to say.

I think with the security concerns addressed and unit tests added it looks mergeable. I'm a bit concerned around the schema and schema evolution. How's that going to work?

dimberman · 2023-02-27T18:37:00Z

LGTM once @bolkedebruin 's comments are addressed

airflow/api_connexion/endpoints/dataset_endpoint.py

Not ready

DjVinnii · 2023-05-16T07:45:37Z

This feature looks like something we can use. I had a quick look and one thing came to mind:

It looks like it is not possible to create a remote dataset event if the dataset does not exist already. Is that correct? I my opinion, it would then also be nice if we can create a dataset by using the API.

uranusjr · 2023-05-18T03:09:11Z

airflow/api_connexion/endpoints/dataset_endpoint.py

 from sqlalchemy import func
 from sqlalchemy.orm import Session, joinedload, subqueryload

+from airflow import Dataset


This should import from airflow.datasets instead.

uranusjr · 2023-05-18T03:11:45Z

airflow/api_connexion/endpoints/dataset_endpoint.py

+        raise BadRequest(detail=str(err))
+    uri = json_body["dataset_uri"]
+    external_source = request.remote_addr
+    user_id = getattr(current_user, "id", None)


Instead of ID, I feel this should use the username. The ID from database should be considered kind of an implementation detail.

isn't the id how a typical FK->PK relationship is defined?
it seems appropriate to add user_id and make a FK to users table. then one could always look up the username by joining?

Yeah good point, this should probably just be a fk.

Edit: Using an fk has problems when a user is deleted though. We probably don’t want to lose the triggering history in the case.

i think there's a mechanism to set to null when user deleted?

Setting this to a (non-fk) username would be much more useful than having a null IMO.

I don't stand in way here, don't really mind, but if you'll humor me I'll think it through out loud...

Maybe i'm just hung up on the standard practice of using surrogate keys, normalization, etc but...

So, airflow provides a mechanism to deactivate users. Interestingly, on the user edit form, it even says don't delete users; it's a bad practice; just deactivate them.

Additionally a username can be changed. So I could take some action, change my username, and now you don't know that I took that action.

Additionally, you could delete a user, have a new user added with same username, but it'd be a different "user".

I think your point that having a username is more useful than having a null is obviously true. But I guess my thought is that, it's not a situation that should happen, because users should not be deleted (because of course by keeping them we don't run into these problems), and if an airflow cluster admin does that well, that's up to them. But presuming they don't you have the benefits of referential integrity.

Interestingly the log table does not have a user_id column, which is a bit weird... probably should... but then there too i'd say it would make sense to record the db ID.

Another option would be to reject the deletion of a user when there are associated log or dataset events. That would seem reasonable too.

So yeah I think i've convinced myself a bit more that using the ID is the right way. I think that the mutability of username is a strong argument in light of security / auditing concerns. But lmkwyt

I guess the other reason @uranusjr would be so that we can use standard ORM features such as relationships to get from user to dataset and vice versa but... i suppose you could say that you could still do so with username via custom join conditions 🤷

brought this up on slack. the suggestion is to not add user column at all right now since user handling will change with AIP-56. and for auditing purposes you can add a record to the Log table.

dstandish · 2023-05-18T04:17:09Z

tests/api_connexion/endpoints/test_dataset_endpoint.py

+            "timestamp": self.default_time,
+        }
+
+    def test_should_raises_401_unauthenticated(self, session):


Suggested change

def test_should_raises_401_unauthenticated(self, session):

def test_should_raise_401_unauthenticated(self, session):

dstandish · 2023-05-18T05:02:22Z

This feature looks like something we can use. I had a quick look and one thing came to mind:

It looks like it is not possible to create a remote dataset event if the dataset does not exist already. Is that correct? I my opinion, it would then also be nice if we can create a dataset by using the API.

Yeah it's a good idea @DjVinnii, perhaps you'd be interested in contributing that? although, if it does not exist, then nothing on the cluster is using it, so it would not really have any effect.

dstandish · 2023-05-18T05:10:48Z

airflow/api_connexion/endpoints/dataset_endpoint.py

+        session=session,
+    )
+
+    if dataset_event:


i wish it threw a helpful exception if there is no dataset instead of using None to signal that. though i don't think there's anything we can do about that now, can we @uranusjr ?

oh wait... this is a new method. so we could. but wait do we even need a new method?

@michaelmicheal why do we need a new method for this? could we not add params to register_dataset_change?

since the params are kwargs-only, i reckon we could make task_instance optional.

and, thankfully, since it accepts **kwargs, adding more params at call site in airflow won't break anything for "old" custom dataset managers

dstandish · 2023-05-18T05:23:50Z

airflow/api_connexion/endpoints/dataset_endpoint.py

+    except ValidationError as err:
+        raise BadRequest(detail=str(err))
+    uri = json_body["dataset_uri"]
+    external_source = request.remote_addr


i am not sure that remote_addr is the right choice here. maybe we could record such information in the log table? but to me it would seem it might be more useful to let this be an arbitrary text field? although i suppose user can always supply information in the extra dict.... wdyt?

DjVinnii · 2023-05-18T06:56:27Z

Yeah it's a good idea @DjVinnii, perhaps you'd be interested in contributing that? although, if it does not exist, then nothing on the cluster is using it, so it would not really have any effect.

Our use case is to be able to synchronize datasets between multiple Airflow instances so that consumers only have to know the dataset name and they don't have to be aware in that instance the producer dag is. At the moment we are creating and updating datasets across Airflow instances by using a hacky dummy dag that produces the dataset in other instances, but an API seems way more robust.

I'm willing to give this a try and contribute.

github-actions · 2023-07-04T00:14:18Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

marclamberti · 2023-07-19T15:54:01Z

Is this feature going to be merged for 2.7?

uranusjr · 2023-07-24T03:25:45Z

This is not finished and cannot be merged. If you are interested in the feature, please open a new pull request and work on it.

boring-cyborg bot added the area:API Airflow's REST/HTTP API label Feb 8, 2023

michaelmicheal force-pushed the gio/add-dataset-update-endpoint branch from 75357d7 to 6020e96 Compare February 8, 2023 19:46

michaelmicheal marked this pull request as ready for review February 8, 2023 19:51

michaelmicheal requested a review from ephraimbuddy as a code owner February 8, 2023 19:51

michaelmicheal force-pushed the gio/add-dataset-update-endpoint branch from 6020e96 to 3325bd9 Compare February 8, 2023 20:14

ephraimbuddy reviewed Feb 9, 2023

View reviewed changes

airflow/api_connexion/openapi/v1.yaml Outdated Show resolved Hide resolved

airflow/api_connexion/openapi/v1.yaml Outdated Show resolved Hide resolved

airflow/api_connexion/endpoints/dataset_endpoint.py Outdated Show resolved Hide resolved

dimberman requested changes Feb 13, 2023

View reviewed changes

airflow/api_connexion/endpoints/dataset_endpoint.py Outdated Show resolved Hide resolved

airflow/datasets/manager.py Outdated Show resolved Hide resolved

michaelmicheal requested a review from dimberman February 13, 2023 20:32

michaelmicheal requested review from ryanahamilton, ashb and bbovenzi as code owners February 15, 2023 14:52

michaelmicheal force-pushed the gio/add-dataset-update-endpoint branch from 5f66a52 to 0650adb Compare February 16, 2023 14:33

eladkal added this to the Airflow 2.6.0 milestone Feb 18, 2023

bolkedebruin reviewed Feb 20, 2023

View reviewed changes

michaelmicheal force-pushed the gio/add-dataset-update-endpoint branch from 0650adb to 144c330 Compare February 21, 2023 14:01

michaelmicheal requested review from kaxil and XD-DENG as code owners February 21, 2023 14:53

michaelmicheal force-pushed the gio/add-dataset-update-endpoint branch from c5502b5 to a724fac Compare February 21, 2023 15:23

uranusjr reviewed Feb 25, 2023

View reviewed changes

airflow/models/dataset.py Outdated Show resolved Hide resolved

bolkedebruin requested changes Feb 27, 2023

View reviewed changes

airflow/api_connexion/endpoints/dataset_endpoint.py Outdated Show resolved Hide resolved

airflow/api_connexion/endpoints/dataset_endpoint.py Outdated Show resolved Hide resolved

bolkedebruin reviewed Feb 27, 2023

View reviewed changes

airflow/api_connexion/endpoints/dataset_endpoint.py Show resolved Hide resolved

michaelmicheal force-pushed the gio/add-dataset-update-endpoint branch from a724fac to 684eac0 Compare February 27, 2023 21:39

michaelmicheal removed request for ryanahamilton, ashb, dimberman, bbovenzi and kaxil February 28, 2023 00:56

michaelmicheal requested review from bolkedebruin and removed request for bolkedebruin March 9, 2023 14:17

clean up

2265cdb

uranusjr reviewed Mar 13, 2023

View reviewed changes

airflow/api_connexion/endpoints/dataset_endpoint.py Outdated Show resolved Hide resolved

michaelmicheal added 6 commits March 13, 2023 10:52

post dataset, provide default value

65a4c65

dataset endpoint, fix tests

51b325f

dataset, change user_id type to Integer

cbb7f3a

dataset_endpoint, post, add default to "extra"

62aa1ea

dataset, add and fix tests

0573945

dataset event post test, soften user_id check

1428c10

ephraimbuddy modified the milestones: Airflow 2.6.0, Airflow 2.6.1 Apr 13, 2023

ashb modified the milestones: Airflow 2.6.1, Airflow 2.7.0 May 6, 2023

uranusjr reviewed May 18, 2023

View reviewed changes

dstandish reviewed May 18, 2023

View reviewed changes

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Jul 4, 2023

github-actions bot closed this Jul 9, 2023

ephraimbuddy removed this from the Airflow 2.7.0 milestone Aug 9, 2023

edumuellerFSL mentioned this pull request Jan 9, 2024

API Endpoints for Dataset Creation and Updating #36686

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dataset update endpoint #29433

Add dataset update endpoint #29433

michaelmicheal commented Feb 8, 2023 •

edited

Loading

ephraimbuddy left a comment

michaelmicheal commented Feb 9, 2023

bolkedebruin Feb 20, 2023

michaelmicheal Feb 21, 2023 •

edited

Loading

bolkedebruin commented Feb 27, 2023

dimberman commented Feb 27, 2023

DjVinnii commented May 16, 2023

uranusjr May 18, 2023 •

edited

Loading

uranusjr May 18, 2023

dstandish May 18, 2023

uranusjr May 18, 2023 •

edited

Loading

dstandish May 18, 2023

dstandish May 18, 2023

uranusjr May 18, 2023

dstandish May 18, 2023 •

edited

Loading

dstandish May 18, 2023 •

edited

Loading

dstandish May 18, 2023

dstandish May 18, 2023

dstandish commented May 18, 2023

dstandish May 18, 2023

dstandish May 18, 2023

dstandish May 18, 2023 •

edited

Loading

dstandish May 18, 2023

dstandish May 18, 2023

DjVinnii commented May 18, 2023

github-actions bot commented Jul 4, 2023

marclamberti commented Jul 19, 2023

uranusjr commented Jul 24, 2023

	def test_should_raises_401_unauthenticated(self, session):
	def test_should_raise_401_unauthenticated(self, session):

Add dataset update endpoint #29433

Add dataset update endpoint #29433

Conversation

michaelmicheal commented Feb 8, 2023 • edited Loading

ephraimbuddy left a comment

Choose a reason for hiding this comment

michaelmicheal commented Feb 9, 2023

Choose a reason for hiding this comment

michaelmicheal Feb 21, 2023 • edited Loading

Choose a reason for hiding this comment

bolkedebruin commented Feb 27, 2023

dimberman commented Feb 27, 2023

DjVinnii commented May 16, 2023

uranusjr May 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uranusjr May 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dstandish May 18, 2023 • edited Loading

Choose a reason for hiding this comment

dstandish May 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dstandish commented May 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dstandish May 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DjVinnii commented May 18, 2023

github-actions bot commented Jul 4, 2023

marclamberti commented Jul 19, 2023

uranusjr commented Jul 24, 2023

michaelmicheal commented Feb 8, 2023 •

edited

Loading

michaelmicheal Feb 21, 2023 •

edited

Loading

uranusjr May 18, 2023 •

edited

Loading

uranusjr May 18, 2023 •

edited

Loading

dstandish May 18, 2023 •

edited

Loading

dstandish May 18, 2023 •

edited

Loading

dstandish May 18, 2023 •

edited

Loading