Optimised Query to fetch attribute level dependency for all target attributes for a given execution plan for 0.6.x Spline model #937

pratapmmmec · 2021-08-09T10:17:18Z

pratapmmmec
Aug 9, 2021

Background [Optional]

Currently Spline 0.6.x has capability to track attribute level lineage (backward) for target data source. Spline UI has the capability which involves two clicks. First at the target attribute and then click on details.

Question

@wajda
We are looking for AQL to fetch attribute level dependency for all attributres of target datasource in an optimal way for a given eventID

For example, we have two datasets Employee and Department and going through below transformation where we are deriving effectiveBonus by deriving it from Employee.bonus and Department.bonusMultiplier. What will be the optimized AQL for the same.

Employee:
empId
empName
deptId
bonus

Department:
deptId
deptName
bonusMultiplier

Code Logic:
Dataset empDS = // Read Employee
Dataset deptDS = // Read Department
Dataset bonus = empDS.join(deptDS, "deptId").withColumn("effectiveBonus”, col(“bonus”).multiply(col("bonusMultiplier”)));
salDS.write().save(“Final”) // Final Table

_Expected Result _
This will be pushed to Meta Integration Tool [MITI] (http://www.metaintegration.net/). We are trying to build more of an adaptor from Spline to MITI

"column_relationship":[
    {"input_entity_name":"Employee", "output_entity_name":"Final", "source_column_name": "empId", "target_column_name": "empId"},
    {"input_entity_name":"Employee", "output_entity_name":"Final", "source_column_name": "empName", "target_column_name": "empName"},
    {"input_entity_name":"Employee", "output_entity_name":"Final", "source_column_name": "deptId", "target_column_name": "deptId"},
    {"input_entity_name":"Employee", "output_entity_name":"Final", "source_column_name": "bonus", "target_column_name": "bonus"},
    {"input_entity_name":"Department", "output_entity_name":"Final", "source_column_name": "deptId", "target_column_name": "deptId"},
    {"input_entity_name":"Department", "output_entity_name":"Final", "source_column_name": "deptName", "target_column_name": "deptName"},
    {"input_entity_name":"Department", "output_entity_name":"Final", "source_column_name": "bonusMultiplier", "target_column_name": "bonusMultiplier"},
    {"input_entity_name":"Department", "output_entity_name":"Final", "source_column_name": "bonusMultiplier", "target_column_name": "effectiveBonus"},
    {"input_entity_name":"Employee", "output_entity_name":"Final", "source_column_name": "bonus", "target_column_name": "effectiveBonus”}
]

Trying to come up with AQL across Operations, attributes etc to comeup with expected result and share once ready. Just want to ensure that its optimized enougbh

wajda · 2021-08-09T11:52:31Z

wajda
Aug 9, 2021
Maintainer

I'm not sure I understand your question. Spline doesn't collect your data, it only collects metadata.
To query the lineage you need to operate with the Spline data model entities. See below.

Entity Definitions

Nodes

Progress

(top-level entity)
Describes the fact (and optionally the result) of execution of a given Execution Plan.
The Progress events are ordered according to their timestamp and represent (describe) concrete portions of data affected by the given run.
The Progress entity introduces a chronological axis to the data lineage. This is the only “dynamic” entity in the Spline data model. The rest entities are static in terms that they don't hold any meta-information that can be logically bound to any specific point in time.

Execution Plan

(top-level entity)
Represents a piece of data transformation pipeline implemented as a standalone application, process, script etc.
Execution Plan consists of exactly one Write operation and any number of Read and Transform operations. (See Operation)

Data Source

(top-level entity)
This entity represents a given uniquely named data location of any type where the data is read from of written to. It could be a file, a table in a database, a Kafka topic, FTP location, REST endpoint etc.
The uniqueness is achieved by using URI as a key.

Operation

An Operation is a building block of the data transformation pipelines. It represents a relational operation executed on a set of data rows

where S₁ , S₂ etc are input sets of data, and S' is the output.
Some operations change the data structure (e.g. SQL SELECT), the others only affect the number of rows (e.g. FILTER or SORT)
Certain operations are binary (JOIN or UNION), the others are unary. Some operations (terminal operations) contain an empty number of input data sets (e.g. Generate). The Read and Write operations are special in meaning that they represent I/O side effect and are always terminal.
Operations form a direct acyclic graph (DAG) with a single Write operation completing the chain.
The Operation is not a top-level entity and is bound to the Execution Plan via the composite-component relation.

Schema

In the Spline model by Schema we understand a structure of a data set that an Operation deals with. Every Operation emits a Schema.
The Schema describes the Attributes and their order. Schemas are logically related to the scope of a given Execution Plan, but could be shared by different Operations (those that don't change the data structure) that belongs to the same Execution Plan.

Attribute

Represents a single attribute of the data relation (e.g. a column in a table). It is characterized by a name and optionally a data type.
Attribute belongs to (one of more) Schema(s) in the scope of the same Execution Plan. Being a part of a Schema the Attribute is logically bound to the Operation that created that attribute (called the operation of origin). Every Attribute has exactly one operation of origin. Two attributes with the same name and type created by different operations are considered different. An attribute can derive from other attributes (e.g. SELECT a+b AS c), and optionally contain a reference to an Expression that describes how exactly it was calculated. (See Expression)

Expression

This is the finest level of abstraction in the Spline data lineage model. An Expression represents a mathematical expression that was used to calculate data for a given Attribute. Expressions form a DAG structure similar to Operations DAG, but operates on the attribute level, rather than a data relation level.
Expression is described by it's name (and optionally a data type), can have any arity, and refer other Expressions or Attribute as operands.

Edges

Progress Of

Connects Progress to the Execution Plan

Depends / Affects

Connects Execution Plan to the Data Source

Reads / Writes

Connects Read or Write operation respectively to the Data Source

Executes

Connects Execution Plan to the Write operation, that is the root of the given operation DAG.

Follows

Connects Operations together in a DAG

Emits

A relationship between an Operation and its output Schema

Consists Of

A composite relationship between Schema and Attribute

Uses

A connection between Operation and Expressions (or Attributes) that this operation refers as parameters (e.g. connection between FILTER operation and a predicate)

Produces

Connects an Attribute to its Operation of origin.

Takes

A connection between an Expression and its operands (other Expressions or Attributes)

Derives From

Models a dependency between Attributes

Computed By

Links an Attribute with an Expression that was used to compute values for the given Attribute

0 replies

pratapmmmec · 2021-08-14T10:05:25Z

pratapmmmec
Aug 14, 2021
Author

@wajda I have update the question also.

I am looking for metadata info related details only(column level lineage of all attributes in final table)

How can I backtrack lineage for all attributes in final node (Write type) to source tables (Read type) using AQL. We don't need intermediate nodes info like joins and projections?

Back to my example of Employee (Read type), Department (Read type) and Final (Write type).
effectiveBonus column in Final_Table is derived from bonus (Employee) and bonusMultiplier (Department), rest all attributes are coming as is from both the Read sources. I am looking for optimized way to query Spline data model to fetch lineage like below:

"column_relationship":[
    {"input_entity_name":"Employee", "output_entity_name":"Final", "source_column_name": "empId", "target_column_name": "empId"},
    {"input_entity_name":"Employee", "output_entity_name":"Final", "source_column_name": "empName", "target_column_name": "empName"},
    {"input_entity_name":"Employee", "output_entity_name":"Final", "source_column_name": "deptId", "target_column_name": "deptId"},
    {"input_entity_name":"Employee", "output_entity_name":"Final", "source_column_name": "bonus", "target_column_name": "bonus"},
    {"input_entity_name":"Department", "output_entity_name":"Final", "source_column_name": "deptId", "target_column_name": "deptId"},
    {"input_entity_name":"Department", "output_entity_name":"Final", "source_column_name": "deptName", "target_column_name": "deptName"},
    {"input_entity_name":"Department", "output_entity_name":"Final", "source_column_name": "bonusMultiplier", "target_column_name": "bonusMultiplier"},
    {"input_entity_name":"Department", "output_entity_name":"Final", "source_column_name": "bonusMultiplier", "target_column_name": "effectiveBonus"},
    {"input_entity_name":"Employee", "output_entity_name":"Final", "source_column_name": "bonus", "target_column_name": "effectiveBonus”}
]

Trying to come up with AQL across Operations, attributes etc to come up with expected result and share once ready. Just want to ensure that its optimized enough

0 replies

pratapmmmec · 2021-08-18T05:33:14Z

pratapmmmec
Aug 18, 2021
Author

@wajda I am using below logic for getting the desired result.

Getting details (uri, table name, attribute names) for Read and Write Operations for a particular event

FOR op IN operation
  FILTER op._belongsTo == 'executionPlan/65fc7a35-6f47-4a59-be3a-9a96ed96f7e4' AND op.type IN ['Read', 'Write'] 
	FOR e IN emits
  		FILTER e._belongsTo == 'executionPlan/65fc7a35-6f47-4a59-be3a-9a96ed96f7e4' AND e._from == op._id
  		    FOR c IN consistsOf
  				FILTER c._belongsTo == 'executionPlan/65fc7a35-6f47-4a59-be3a-9a96ed96f7e4' AND c._from == e._to
			FOR a IN attribute
  				FILTER a._belongsTo == 'executionPlan/65fc7a35-6f47-4a59-be3a-9a96ed96f7e4' AND a._id == c._to
  RETURN merge (op, e, c, a)

Using attributes (from Write Operation only) from above query to fetch the backward lineage for each one of them

LET att = "attribute/65fc7a35-6f47-4a59-be3a-9a96ed96f7e4:64"

FOR df IN derivesFrom
  FILTER df._from == att
      FOR a IN attribute
      	FILTER df._to == a._id
      	    FOR p IN produces
  		      FILTER df._to == p._to   
  		        FOR op IN operation
  		          FILTER p._from == op._id 
      RETURN  { "att": att,
                "aName": a.name, 
                "aId": a._id,
                "opExtra": op.extra,
                "opId": op._id,
                "opSource": op.inputSources,
                "opType": op.type}

1 reply

wajda Aug 18, 2021
Maintainer

Regarding no. 1:

You basically join Read/Write operations in a given execution plan with their related attributes. I'm not 100% sure, but I assume you're going to group on the operation later at some point to reassemble a 1-n relation.
With that assumption in mind and in general there are a few things I would like to mention:

The filter condition _belongsTo == 'executionPlan/65fc7a35-6f47-4a59-be3a-9a96ed96f7e4'_belongsTo == 'executionPlan/65fc7a35-6f47-4a59-be3a-9a96ed96f7e4' is redundant in every statement excepts for the first one.
There is no useful (from the final result perspective) info in edges excepts for the order info for some of them (e.g. consistsOf.index), so merging e, c can be omitted.
Unless the attribute order is irrelevant for your needs, you need an additional SORT statement on consistsOf.index
A graph traversal will make the query both shorter and faster.

Consider this:

FOR op IN operation
    FILTER op._belongsTo == 'executionPlan/a3872ecd-d49e-4ea9-926e-6f796a08e26b'
       AND op.type IN ['Read', 'Write'] 
    FOR attr, e IN 2 OUTBOUND op emits, consistsOf 
        SORT e.index
        RETURN {op, attr}

or grouped:

FOR op IN operation
    FILTER op._belongsTo == 'executionPlan/a3872ecd-d49e-4ea9-926e-6f796a08e26b'
       AND op.type IN ['Read', 'Write'] 
    RETURN {
        op,
        attrs: (FOR attr, e IN 2 OUTBOUND op emits, consistsOf SORT e.index RETURN attr)
    }

Regarding no. 2:

That query only gives you a single step in the attribute dependency graph. If you are interested to see what attributes from which Read operations a given X attribute of a Write operation depends on, you need to use a graph traversal over the derivesFrom edge:

LET att = "attribute/a3872ecd-d49e-4ea9-926e-6f796a08e26b:947"

FOR a, e, p IN 1..999999 OUTBOUND att derivesFrom
    LET op = FIRST(FOR op IN 1 INBOUND a produces RETURN op)
    FILTER op.type == "Read"
    COLLECT aId=a._id INTO g
    LET a = g[0].a
    LET op = g[0].op
    RETURN {
        "att": att,
        "aName": a.name, 
        "aId": a._id,
        "opExtra": op.extra,
        "opId": op._id,
        "opSource": op.inputSources,
        "opType": op.type
    }

Hope it helps.

abhishekshenoy · 2021-08-19T05:47:36Z

abhishekshenoy
Aug 19, 2021

Grouping of Attribute for an Operation makes it more meaningful. We are working for a way to combine 1 and 2 , wherein we are able to retrieve attributes grouped based on operation and the attributes for write operation type having derived from information.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimised Query to fetch attribute level dependency for all target attributes for a given execution plan for 0.6.x Spline model #937

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Optimised Query to fetch attribute level dependency for all target attributes for a given execution plan for 0.6.x Spline model #937

pratapmmmec Aug 9, 2021

Background [Optional]

Question

Replies: 4 comments · 1 reply

wajda Aug 9, 2021 Maintainer

Entity Definitions

Nodes

Progress

Execution Plan

Data Source

Operation

Schema

Attribute

Expression

Edges

Progress Of

Depends / Affects

Reads / Writes

Executes

Follows

Emits

Consists Of

Uses

Produces

Takes

Derives From

Computed By

pratapmmmec Aug 14, 2021 Author

pratapmmmec Aug 18, 2021 Author

wajda Aug 18, 2021 Maintainer

abhishekshenoy Aug 19, 2021

pratapmmmec
Aug 9, 2021

Replies: 4 comments 1 reply

wajda
Aug 9, 2021
Maintainer

pratapmmmec
Aug 14, 2021
Author

pratapmmmec
Aug 18, 2021
Author

wajda Aug 18, 2021
Maintainer

abhishekshenoy
Aug 19, 2021