Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: complete task JSON serialization for other types (like data task, manifest task) #9597

Closed
stevenzwu opened this issue Jan 31, 2024 · 15 comments
Labels

Comments

@stevenzwu
Copy link
Contributor

stevenzwu commented Jan 31, 2024

Feature Request / Improvement

Right now, FileScanTaskParser JSON serializer only handles BaseFileScanTask for data files (see issue #1698 ). There are other FileScanTask impl classes that are not covered: StaticDataTask, AllManifestsTable$ManifestListReadTask, BaseEntriesTable$ManifestReadTask, BaseFilesTable$ManifestReadTask. This was discovered while I was trying to Flink FLIP-27 IcebergSource with metadata tables unit tests.

I propose that we add a type (or impl) filed to the JSON format that captures the FQCN of the FileScanTask implementation class. Fortunately, with JSON format, this can be a backward compatible change. if the type field is not present, the implementation class is defaulted to BaseFileScanTask.

We can also incrementally add the missing implementations. The next one to tackle should be the StaticDataTask.

cc @nastra @rdblue @aokolnychyi @pvary

Query engine

None

@stevenzwu
Copy link
Contributor Author

Regarding StaticDataTask, we can make this constructor package private.

  private StaticDataTask(
      InputFile metadata, Schema tableSchema, Schema projectedSchema, StructLike[] rows) {
       ...
  }

The main question is how to serialize the StructLike row. We can use Java serialization to convert the row to byte[]. We would need base64 encoding to convert the byte array to a string before adding it to JSON array. Base64 will increase the serialized bytes by 33%. Not sure if we have better option here.

@nastra
Copy link
Contributor

nastra commented Feb 1, 2024

I think we could make this similar to how it's done for a MetricsReport. We have a ReportMetricsRequestParser that reads a report-type field and then delegates to the respective parser for the particular type as can be seen here.

In this case we could have a task-type at the JSON level, which could be an enum in Java (similar to ReportType). I'm not sure using FQCN is necessarily good, because classes can be moved around and their names could change.

I haven't looked what challeges we'd have with serializing all subclasses of FileScanTask, but in terms of handling different task types the proposed approach could be used.

@stevenzwu
Copy link
Contributor Author

@nastra thanks for the comments!

Regarding the JSON format, we are on the same page of adding a new task-type or type field.

Are you suggesting adding a new API FileScanTask#type()? I can see the enum type works well for ReportMetricsRequest. But I don't know if is most natural with FileScanTask's existing class hierarchy. In my mind, class name (FQCN) is essentially the type.

image

Class renaming/moving around can be a problem although I don't know if practically we should do that. We can add a unit test to assert the class's FQCN didn't change. if renamed/relocated, the parser needs to be updated to track both old and new names.

what's your take on the problem of serializing the StrucktLike row?

@nastra
Copy link
Contributor

nastra commented Feb 2, 2024

Are you suggesting adding a new API FileScanTask#type()?

I was suggesting an enum type at the JSON level, not at the API level (similar to how it's done for ReportMetricsRequest).

Regarding StrucktLike serialization, I haven't had time yet to take a closer look at it unfortunately.

@stevenzwu
Copy link
Contributor Author

stevenzwu commented Feb 2, 2024

@nastra ReportMetricsRequest has the type at API level. That is probably where my confusions came from earlier.

  ReportType reportType();

If I understand you correctly, we can define the enum type inside the FileScanTaskParser. that would make sense to me. serializer can map the enum value to the implementation class.

@nastra
Copy link
Contributor

nastra commented Feb 2, 2024

@stevenzwu this is only because ReportMetricsRequest is a REST request class for a MetricsReport. So in the case of this issue here we'd define the enum type at the JSON level in the parser that would handle different task types and delegate to their respective JSON parsers

@stevenzwu
Copy link
Contributor Author

stevenzwu commented Feb 3, 2024

yeah. we are on the same page now. TaskType enum can be defined in FileScanTaskParser.

regarding the StructLike row serialization for StaticDataTask,

  // need to make this construct package private so that parser class can use the constructor directly
  private StaticDataTask(
      InputFile metadata, Schema tableSchema, Schema projectedSchema, StructLike[] rows) {
       ...
  }

I am thinking maybe we should implement a StructParser for JSON serialization of StructLike.

class StructParser {
    public static String toJson(StructLike struct, Schema schema);
    public static StructLike fromJson(String json, Schema schema);
}

@stevenzwu
Copy link
Contributor Author

stevenzwu commented Feb 8, 2024

@nastra @aokolnychyi any feedback on the proposal of adding a StructParser JSON serializer?

As for the InputFile, we can add a new constructor and serialize the DataFile using existing ContentFileParser.

StaticDataTask(DataFile metadata, Schema tableSchema, Schema projectedSchema, StructLike[] rows) 

@stevenzwu
Copy link
Contributor Author

Regarding the InputFile, we can serialize the DataFile using existing ContentFileParser

  private StaticDataTask(
      InputFile metadata, Schema tableSchema, Schema projectedSchema, StructLike[] rows) {
    this.tableSchema = tableSchema;
    this.projectedSchema = projectedSchema;
    this.metadataFile =
        DataFiles.builder(PartitionSpec.unpartitioned())
            .withInputFile(metadata)
            .withRecordCount(rows.length)
            .withFormat(FileFormat.METADATA)
            .build();
    this.rows = rows;
  }

We can add a new package private constructor

StaticDataTask(DataFile metadataFile, Schema tableSchema, Schema projectedSchema, StructLike[] rows)

@aokolnychyi
Copy link
Contributor

aokolnychyi commented Feb 13, 2024

I think it is reasonable to have a field in JSON that would indicate the task type. I'd also avoid any changes in task APIs, we can leverage that enum only in the parser. I doubt using FQCN is a good idea as it would make the implementation specific to Java.

Can we reuse SingleValueParser for serializing structs? I believe it is what we use today to serialize partitions.

@aokolnychyi
Copy link
Contributor

Given that the task serialization is part of the spec, shall we raise this discussion during the sync?

@rdblue
Copy link
Contributor

rdblue commented Feb 15, 2024

I think we should definitely use single-value serialization for the values in the structs when we convert to JSON. I probably wouldn't use objects, though. We could use a list and send values by position instead.

@stevenzwu
Copy link
Contributor Author

I think we should definitely use single-value serialization for the values in the structs when we convert to JSON. I probably wouldn't use objects, though. We could use a list and send values by position instead.

today SingleValueParser serializes a struct to a JSON object using field ID as JSON field name. the StructLike []rows would be serialized to a list of JSON object.

@rdblue are you suggesting a different behavior than the current SingleValueParser implementation for struct?

@stevenzwu stevenzwu changed the title Core: complete FileScanTaskParser for other FileScanTask implementation classes (like StaticDataTask) Core: complete task JSON serialization for other types (like data task, manifest task) Feb 16, 2024
Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added the stale label Oct 13, 2024
Copy link

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants