How different execution events are related between them #1260

petergomez97 · 2023-08-24T09:40:33Z

petergomez97
Aug 24, 2023

In our company we have the need to have data lineage not only for one process but between different executions. This means that if a process from an execution writes an output file named "x" and then another process from another execution has "x" as an input source I want it to be related. I have seen that for an execution it has related to a different process that has the same output as input source, however for the rest of processes it has not done it

Therefore, my question is how do different executions get related? By the data source name?
Additionally, how does a data source node get named? I have seen that for a execution event that writes a file called "x" with an url and then the same file is used in a different process that is used as input data source has a different fname for the data source node. This might be the issue why they are not linked in the data lineage graph

petergomez97 · 2023-08-24T09:56:19Z

petergomez97
Aug 24, 2023
Author

With the last question I mean that for example for a process that has different output files written and therefore different executions the target our output file is IL_CLE_2_1_2 with a URL of file:/fastdisk3/flight_searches/2_1_2/IL_CLE_2_1_2
This is the picture

Inside the processing node this is the node where it is written

Maybe the partitions are cuasing the change in data source name?

Then another execution that uses the same output data source as input source and has the same URL the data source name is different (it is called by the last partition).

Any reason for this? I want to have the complete view in the lineage diagram of both executions.

Thanks in advance

0 replies

cerveada · 2023-08-24T09:59:02Z

cerveada
Aug 24, 2023
Maintainer

how do different executions get related? By the data source name?

By the data source URI - this can be different for each data source type, but at the end it's a String in the database.

how does a data source node get named?

There are multiple plugins of difference data source types in Spark Spline Agent. Each Plugin is responsible for extracting the URIs from its data sources.

Maybe the partitions are cuasing the change in data source name?

Yes, this is a known issue.

Generally, this is an unsolvable problem. Consider OS path and one file having multiple aliases or server that is accessible from different IP addresses from different networks.

We want to solve this eventually by allowing Spline admin to define which URIs should be considered the same data source, but work on this haven't even began.

See the issue here: #689

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How different execution events are related between them #1260

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How different execution events are related between them #1260

petergomez97 Aug 24, 2023

Replies: 2 comments

petergomez97 Aug 24, 2023 Author

cerveada Aug 24, 2023 Maintainer

petergomez97
Aug 24, 2023

petergomez97
Aug 24, 2023
Author

cerveada
Aug 24, 2023
Maintainer