Is it possible to get lineage from spark execution plan? #659

liujiawen · 2023-04-20T09:38:34Z

Background

I found in my production environment, it's hard to add listener to the spark shell. Then i thought it will be ok to get execution plan of spark jobs from the history server. Is it possible to use the execution plan i downloaded instead of spark job listener?
Thank you!

Feature

Add a feature to read spark execution plan file, get lineage and send data to the spline producer.

Example [Optional]

example logical plan:
InsertIntoHadoopFsRelationCommand s3://example/dw/ods/ods_example_table, [dt=2023-04-18], false, [dt#62], Parquet, [field.delim=�, line.delim=
, serialization.format=�, partitionOverwriteMode=DYNAMIC, parquet.compression=SNAPPY, mergeSchema=false], Overwrite, CatalogTable(
Database: default
Table: ods_example_table
Owner: hdfs
Created Time: Wed Apr 19 16:03:44 CST 2023
Last Access: UNKNOWN
Created By: Spark 2.2 or prior
Type: EXTERNAL
Provider: hive
Table Properties: [bucketing_version=2, parquet.compression=SNAPPY, serialization.null.format=, transient_lastDdlTime=1681891424]
Location: s3://example/dw/ods/ods_example_table
Serde Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Storage Properties: [serialization.format=�, line.delim=
, field.delim=�]
Partition Provider: Catalog
Partition Columns: [dt]
Schema: root
|-- seller_id: string (nullable = true)
|-- code: string (nullable = true)
|-- country_code: string (nullable = true)
|-- org_code: string (nullable = true)
|-- name_cn: string (nullable = true)
|-- name_en: string (nullable = true)
|-- api: string (nullable = true)
|-- create_time: string (nullable = true)
|-- creater: string (nullable = true)
|-- update_time: string (nullable = true)
|-- operator: string (nullable = true)
|-- status: string (nullable = true)
|-- org_code_add: string (nullable = true)
|-- user_name: string (nullable = true)
|-- is_online: string (nullable = true)
|-- dt: string (nullable = true)
), org.apache.spark.sql.execution.datasources.CatalogFileIndex@a03e7be2, [seller_id, code, country_code, org_code, name_cn, name_en, api, create_time, creater, update_time, operator, status, org_code_add, user_name, is_online, dt]
+- Project [seller_id#47, code#48, country_code#49, org_code#50, name_cn#51, name_en#52, api#53, create_time#54, creater#55, update_time#56, operator#57, status#58, org_code_add#59, user_name#60, is_online#61, cast(2023-04-18 as string) AS dt#62]
+- Project [cast(seller_id#0 as string) AS seller_id#47, cast(code#1 as string) AS code#48, cast(country_code#2 as string) AS country_code#49, cast(org_code#3 as string) AS org_code#50, cast(name_cn#4 as string) AS name_cn#51, cast(name_en#5 as string) AS name_en#52, cast(api#6 as string) AS api#53, cast(create_time#7 as string) AS create_time#54, cast(creater#8 as string) AS creater#55, cast(update_time#9 as string) AS update_time#56, cast(operator#10 as string) AS operator#57, cast(status#11 as string) AS status#58, cast(org_code_add#12 as string) AS org_code_add#59, cast(user_name#13 as string) AS user_name#60, cast(is_online#14 as string) AS is_online#61]
+- Project [seller_id#0, code#1, country_code#2, org_code#3, name_cn#4, name_en#5, api#6, create_time#7, creater#8, update_time#9, operator#10, status#11, org_code_add#12, user_name#13, is_online#14]
+- SubqueryAlias spark_catalog.dbinit.ods_example_table
+- HiveTableRelation [dbinit.ods_example_table, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [seller_id#0, code#1, country_code#2, org_code#3, name_cn#4, name_en#5, api#6, create_time#7, cre..., Partition Cols: []]

Proposed Solution [Optional]

The text was updated successfully, but these errors were encountered:

wajda · 2023-04-21T05:49:16Z

No, Spark Agent is meant to be used as a listener. The execution plan representation from the Spark history server isn't enough. Such functionality however could be implemented as another project, something like Spline Agent for Spark History Server.

liujiawen · 2023-04-21T06:30:17Z

Spline Agent for Spark History Server

No, Spark Agent is meant to be used as a listener. The execution plan representation from the Spark history server isn't enough. Such functionality however could be implemented as another project, something like Spline Agent for Spark History Server.

Thank you wajda for your instructive answer!

wajda closed this as completed Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to get lineage from spark execution plan? #659

Is it possible to get lineage from spark execution plan? #659

liujiawen commented Apr 20, 2023

wajda commented Apr 21, 2023

liujiawen commented Apr 21, 2023

Is it possible to get lineage from spark execution plan? #659

Is it possible to get lineage from spark execution plan? #659

Comments

liujiawen commented Apr 20, 2023

Background

Feature

Example [Optional]

Proposed Solution [Optional]

wajda commented Apr 21, 2023

liujiawen commented Apr 21, 2023