Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If a Dataset symlink is created afterwards with a DatasetEvent, the link is not created in the lineage #2738

Open
dkt-sophie-ly opened this issue Jan 26, 2024 · 3 comments
Milestone

Comments

@dkt-sophie-ly
Copy link

dkt-sophie-ly commented Jan 26, 2024

If I create 2 run events that create 2 separate lineage like the following:

ns1:input1 ----- job1 -----> ns2:output1

and

ns1:input2 ------ job2 -----> ns2:output

Then I sent a DatasetEvent to create a symlink and specify that input1 and input2 are in fact the same dataset.

{
  "eventTime": "2023-07-18T17:20:00",
  "dataset": {
    "namespace": "ns1",
    "name": "input1",
    "facets": {
      "symlinks": {
        "identifiers": [
          {
            "namespace": "ns1",
            "name": "input2",
            "type": "DB_TABLE"
          }
        ]
      }
    }
  }
}

So I expected that the 2 lineage merge into one like the following:

ns1:input1 ----- job1 -----> ns2:output1
|
|--------------- job2 ------> ns2:output2

But currently both lineage are not merge and stay separated.

@wslulciuc
Copy link
Member

@dkt-sophie-ly, with PR #2641, we should expect the lineage graph to begin using the symlink in the DatasetEvent. But, after looking into how the lineage graph is built:

  1. Call LineageDao.getLineage() to get the job node data
  2. Then, LineageDao.getDatasetData for the dataset node data

I don't think we've invested heavily on building out symlink support for our lineage graph. @pawel-big-lebowski let me know if that's not the case.

@dkt-sophie-ly
Copy link
Author

dkt-sophie-ly commented Feb 7, 2024

Hi @wslulciuc ! Thanks for your reply :)

With this PR #2736 the lineage graph should be able to see the lineage of symlink dataset but only if the symlinks is built beforehand like that.
ex:

{"input1":
"symlinks": 
"identifiers" [{"namespace": "ns2", "name": "input2"}]
}

Here input1 and input2 can be linked together because they have the same dataset uuid in datasets_view.

If the symlink is created afterwards (both dataset created separately with 2 different runs and then a dataset event add a symlink between these 2 lineage) the lineage won't be linked because they already have a different dataset uuid.

I don't know if it will be possible but it could be great if a symlinks is created afterwards with a DatasetEvent the dataset uuid change accordingly (ex: change input2 dataset uuid to be the same as input1).

@dkt-sophie-ly
Copy link
Author

Hi @wslulciuc Just a kind reminder in this issue :)

@wslulciuc wslulciuc added this to the 0.52.0 milestone Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

2 participants