Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop requiring users to import dataclasses_json or DataClassJSONMixin for dataclass #2279

Merged
merged 30 commits into from
Apr 3, 2024

Conversation

Future-Outlier
Copy link
Member

@Future-Outlier Future-Outlier commented Mar 18, 2024

Tracking issue

flyteorg/flyte#4486

Why are the changes needed?

For a better user experience.

What changes were proposed in this pull request?

  1. use mashumaro>=3.11, so that we can use JSONEncoder and JSONDecoder

  2. change python_val.to_json() to JSONEncoder(python_type).encode(python_val)

  3. change expected_python_type.from_json(json_str) to JSONDecoder(expected_python_type).decode(json_str)

  4. change return dataclass_json(dataclasses.make_dataclass(schema_name, attribute_list)) to
    return dataclasses.make_dataclass(schema_name, attribute_list),
    since we don't need to_json method and from_json method anymore.

  5. add tests

  6. fix mypy errors

  7. change type annotations

  8. remove flytekit-dolt from CI test, since it doesn't work now and needs to be implemented a new version.

  9. use JSONEncoder and JSONDecoder to convert dataclass to json str and convert json str to dataclass when
    the user didn't use dataclasses_json and DataClassJSONMixin.

  10. add an encoder registry and a decoder registry to cache JSONEncoder and JSONDecoder when using List[dataclass]

  11. add a benchmark test in real case scenario by dynamic workflow and return List[dataclass]

How was this patch tested?

  1. unit tests
  2. local and remotely with only dataclass decorator
  3. local and remotely with dataclass inherits from DataClassJSONMixin. (for backward compatible)
  4. local and remotely with dataclass_json decorator. (for backward compatible)

Note: you can use futureoutlier/dataclass:0321 this image to test it.

Setup process

python dataclass_example.py
pyflyte run --remote --image localhost:30000/dataclass:0951 dataclass_example.py dataclass_wf --x 10 --y 20
import os
import tempfile
from dataclasses import dataclass
from typing import Tuple, List, Optional
import pandas as pd
from flytekit import task, workflow
from flytekit.types.directory import FlyteDirectory
from flytekit.types.file import FlyteFile
from flytekit.types.structured import StructuredDataset
# from mashumaro.mixins.json import DataClassJSONMixin

@dataclass
class Datum:
    x: int
    y: str
    z: dict[int, int]
    w: List[int] = None

@task
def stringify(s: int) -> Datum:
    """
    A dataclass return will be treated as a single complex JSON return.
    """
    return Datum(x=s, y=str(s), z={s: str(s)}, w=[s,s,s,s])

@task
def add(x: Datum, y: Datum) -> Datum:
    """
    Flytekit automatically converts the provided JSON into a data class.
    If the structures don't match, it triggers a runtime failure.
    """
    x.z.update(y.z)
    return Datum(x=x.x + y.x, y=x.y + y.y, z=x.z, w=x.w + y.w)

@dataclass
class FlyteTypes:
    dataframe: StructuredDataset
    file: FlyteFile
    directory: FlyteDirectory

@task
def upload_data() -> FlyteTypes:
    """
    Flytekit will upload FlyteFile, FlyteDirectory and StructuredDataset to the blob store,
    such as GCP or S3.
    """
    # 1. StructuredDataset
    df = pd.DataFrame({"Name": ["Tom", "Joseph"], "Age": [20, 22]})

    # 2. FlyteDirectory
    temp_dir = tempfile.mkdtemp(prefix="flyte-")
    df.to_parquet(temp_dir + "/df.parquet")

    # 3. FlyteFile
    file_path = tempfile.NamedTemporaryFile(delete=False)
    file_path.write(b"Hello, World!")

    fs = FlyteTypes(
        dataframe=StructuredDataset(dataframe=df),
        file=FlyteFile(file_path.name),
        directory=FlyteDirectory(temp_dir),
    )
    return fs


@task
def download_data(res: FlyteTypes):
    assert pd.DataFrame({"Name": ["Tom", "Joseph"], "Age": [20, 22]}).equals(res.dataframe.open(pd.DataFrame).all())
    f = open(res.file, "r")
    assert f.read() == "Hello, World!"
    assert os.listdir(res.directory) == ["df.parquet"]

@workflow
def dataclass_wf(x: int, y: int) -> Tuple[Datum, FlyteTypes]:
    o1 = add(x=stringify(s=x), y=stringify(s=y))
    o2 = upload_data()
    download_data(res=o2)
    return o1, o2

if __name__ == "__main__":
    print(dataclass_wf(x=10, y=20))
FROM python:3.9-slim-buster
USER root
WORKDIR /root
ENV PYTHONPATH /root
RUN apt-get update && apt-get install build-essential -y
RUN apt-get install git -y

RUN pip install -U git+https://github.com/flyteorg/flytekit.git@30223e45c6b773cb25846f5031f92e4f1f783c33
RUN pip install pandas -U

Screenshots

local execution (with only dataclass decorator)

image

local execution (with DataClassJSONMixin)

image

remote execution (with only dataclass decorator)

image image

remote execution (with DataClassJSONMixin)

image image

remote execution (with dataclass_json)

image

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Mar 18, 2024
@Future-Outlier Future-Outlier marked this pull request as draft March 18, 2024 13:42
@Future-Outlier Future-Outlier marked this pull request as ready for review March 18, 2024 14:22
flytekit/core/type_engine.py Show resolved Hide resolved
@@ -720,7 +701,7 @@ def _fix_val_int(self, t: typing.Type, val: typing.Any) -> typing.Any:

return val

def _fix_dataclass_int(self, dc_type: Type[DataClassJsonMixin], dc: DataClassJsonMixin) -> DataClassJsonMixin:
def _fix_dataclass_int(self, dc_type: Type, dc: dataclasses.dataclass) -> dataclasses.dataclass: # type: ignore
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure is here has a better way to write type annotation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can dc_type be Type[dataclasses.dataclass]? (I think this is required for dataclasses.fields to work.)

Given how dynamic the code is, I think we have to go with dc: typing.Any for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right!
Have updated your advice, thank you.

Copy link

codecov bot commented Mar 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.49%. Comparing base (55f0b19) to head (0a33f53).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2279      +/-   ##
==========================================
+ Coverage   83.46%   83.49%   +0.03%     
==========================================
  Files         324      324              
  Lines       24754    24757       +3     
  Branches     3521     3519       -2     
==========================================
+ Hits        20662    20672      +10     
+ Misses       3460     3455       -5     
+ Partials      632      630       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Future-Outlier Future-Outlier changed the title Remove dataclass_json and DataClassJSONMixin for dataclass transformer Remove required dataclass_json and DataClassJSONMixin for dataclass transformer Mar 18, 2024
@Future-Outlier Future-Outlier changed the title Remove required dataclass_json and DataClassJSONMixin for dataclass transformer Stop requiring users to import dataclasses_json or DataClassJSONMixin for dataclass Mar 18, 2024
@Future-Outlier
Copy link
Member Author

cc @thomasjpfan Please take a look, thank you!

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Mar 19, 2024
@Future-Outlier Future-Outlier marked this pull request as draft March 19, 2024 02:22
@Future-Outlier Future-Outlier marked this pull request as ready for review March 19, 2024 02:46
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Mar 19, 2024
@Future-Outlier Future-Outlier force-pushed the remove-dataclass_json branch from dbdd3eb to bd0802e Compare March 19, 2024 02:51
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Mar 19, 2024
@Future-Outlier Future-Outlier changed the title Stop requiring users to import dataclasses_json or DataClassJSONMixin for dataclass Stop requiring users to import dataclasses_json or DataClassJSONMixin for dataclass Mar 19, 2024
@Future-Outlier Future-Outlier force-pushed the remove-dataclass_json branch from b426eea to 5a02577 Compare March 19, 2024 12:16
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Mar 19, 2024
@Future-Outlier
Copy link
Member Author

I think we can remove all DataClassJSONMixin class which has also @dataclass decorator.
Should we do this in this PR or create another?

image

Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
@Future-Outlier
Copy link
Member Author

@thomasjpfan , @Fatal1ty
I use this example to test the speed of workflow, it seems that the performance is close : )

@task
def create_dataclasses() -> List[Datum]:
    return [Datum(x=1, y="1", z={1: 1}, w=[1,1,1,1])]

@task
def concat_dataclasses(x: List[Datum], y: List[Datum]) -> List[Datum]:
    return x + y

@dynamic
def dynamic_wf() -> List[Datum]:
    all_dataclasses = [Datum(x=1, y="1", z={1: 1}, w=[1,1,1,1])]
    for _ in range(300):
        data = create_dataclasses()
        all_dataclasses = concat_dataclasses(x=all_dataclasses, y=data)
    return all_dataclasses

@workflow
def benchmark_workflow() -> List[Datum]:
    return dynamic_wf()

if __name__ == "__main__":
    import time
    start_time = time.time()
    benchmark_workflow()
    end_time = time.time()
    print(f"Time taken: {end_time - start_time}")
image

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Mar 22, 2024
@Future-Outlier
Copy link
Member Author

I think the main reason for this performance gap is that we need to create a JsonEncoder or a JsonDecoder to serialize and deserialize our dataclasses.

I’m sure it is. Creating decoders and encoders is not a cheap operation. I would recommend to use a registry dataclass_type -> decoder(encoder) if this is an issue for you.

I've tested it on your advice, and it really reduces the time drastically.
image

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good. I am okay with the current scope. We can update the other parts of the codebase in follow up PRs.

flytekit/core/type_engine.py Outdated Show resolved Hide resolved
Comment on lines 738 to 739
if not self._decoder.get(expected_python_type):
self._decoder[expected_python_type] = JSONDecoder(expected_python_type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here regular try: except KeyError:

@@ -720,7 +701,7 @@ def _fix_val_int(self, t: typing.Type, val: typing.Any) -> typing.Any:

return val

def _fix_dataclass_int(self, dc_type: Type[DataClassJsonMixin], dc: DataClassJsonMixin) -> DataClassJsonMixin:
def _fix_dataclass_int(self, dc_type: Type, dc: dataclasses.dataclass) -> dataclasses.dataclass: # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can dc_type be Type[dataclasses.dataclass]? (I think this is required for dataclasses.fields to work.)

Given how dynamic the code is, I think we have to go with dc: typing.Any for now.

Comment on lines +2505 to +2507
class DatumDataclass:
x: int
y: Color
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add another test to see what happens with typing.Union? Specifically:

@dataclass
class DatumDataUnion:
    path: typing.Union[str, os.PathLike]

If the path started out as a pathlib.Path(...), does it deserialize into a string or a Path object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that Path object is not serializable, but I will test it with other cases!

@dataclass
    class DatumDataUnion(DataClassJSONMixin):
        path: typing.Union[str, os.PathLike]

lt = TypeEngine.to_literal_type(DatumDataUnion)
    datum_dataunion = DatumDataUnion(Path("/tmp"))
    lv = transformer.to_literal(ctx, datum_dataunion, DatumDataUnion, lt)
    gt = transformer.guess_python_type(lt)
    pv = transformer.to_python_value(ctx, lv, expected_python_type=gt)
    assert datum_dataunion.path == pv.path

screenshots:
image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that TypeTransformer can't handle support for typing.Union[str, os.PathLike] or typing.Union[str, FlyteFile].
(I thought that os.PathLike is decided not supported here.
https://docs.flyte.org/en/latest/api/flytekit/generated/flytekit.types.file.FlyteFile.html#flytekit.types.file.FlyteFile.path

Support typing.Union[str, FlyteFile] in flyte could be an enhancement.
For more details, I need to trace more code to find the core reason why we can support these cases now, but it definitely is not because of this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, thank you for checking. When I suggested typing.Union[str, os.PathLike], I was thinking about FlyteFile. But FlyteFile has it's own type transformer, so it's okay.

In any case, I'm okay with the current scope of this PR.

@wild-endeavor
Copy link
Contributor

is this backwards compatible? serialize with an old flytekit release, with old user code. then deserialize with new flytekit and new user code.

self._serialize_flyte_type(python_val, python_type)

json_str = python_val.to_json() # type: ignore
if hasattr(python_val, "to_json"):
json_str = python_val.to_json()
Copy link
Member

@thomasjpfan thomasjpfan Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wild-endeavor With this check, we are backward compatible. If one used DataClassJsonMixin or @dataclass_json, then to_json is defined and called, which matches master's behavior. XREF: https://github.com/lidatong/dataclasses-json/blob/8512afc0a87053dbde52af0519c74198fa3bb873/dataclasses_json/api.py#L26

@Future-Outlier Can you include a comment here about how this preserves backward compatibility?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @wild-endeavor and @thomasjpfan

In flytekit-python, we use the function to_json to serialize dataclasses and from_json to deserialize them. Previously, we required users in older flytekit releases to add thedataclass_json decorator or inherit from the DataclassJsonMixin class.
This was because both provided the necessary methods to serialize (convert a dataclass to bytes) and deserialize (convert bytes back to a dataclass).

In this PR, the mashumaro module introduces two classes, JSONEncoder and JSONDecoder, for serializing and deserializing dataclasses.
These new classes eliminate the reliance on the to_json and from_json methods.

Initially, we used if hasattr(python_val, "to_json"): to check for the method's presence.
Therefore, introducing these changes will not cause any breaking changes.
This means that dataclasses inheriting from DataclassJsonMixin will continue to use to_json and from_json for serialization and deserialization when using this version of flytekit.

to_json REF: https://github.com/flyteorg/flytekit/blob/master/flytekit/core/type_engine.py#L498
from_json REF: https://github.com/flyteorg/flytekit/blob/master/flytekit/core/type_engine.py#L750

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, I wanted to have a short comment in the code that states how the "to_json" check, helps preserves backward compatibility.

Signed-off-by: Future-Outlier <[email protected]>
Co-authored-by: Thomas J. Fan <[email protected]>
thomasjpfan
thomasjpfan previously approved these changes Apr 2, 2024
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not requiring dataclasses_json or DataClassJSONMixin for many use cases is already a net improvement.

LGTM

self._serialize_flyte_type(python_val, python_type)

json_str = python_val.to_json() # type: ignore
if hasattr(python_val, "to_json"):
json_str = python_val.to_json()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, I wanted to have a short comment in the code that states how the "to_json" check, helps preserves backward compatibility.

Comment on lines +2505 to +2507
class DatumDataclass:
x: int
y: Color
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, thank you for checking. When I suggested typing.Union[str, os.PathLike], I was thinking about FlyteFile. But FlyteFile has it's own type transformer, so it's okay.

In any case, I'm okay with the current scope of this PR.

@dosubot dosubot bot added the lgtm This PR has been approved by maintainer label Apr 2, 2024
@Future-Outlier
Copy link
Member Author

I think not requiring dataclasses_json or DataClassJSONMixin for many use cases is already a net improvement.

LGTM

Added comments, thank you so much

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pingsutw pingsutw merged commit f888a5c into master Apr 3, 2024
48 checks passed
ChungYujoyce pushed a commit to ChungYujoyce/flytekit that referenced this pull request Apr 5, 2024
…xin` for dataclass (flyteorg#2279)

Signed-off-by: Future-Outlier <[email protected]>
Co-authored-by: Thomas J. Fan <[email protected]>
Co-authored-by: Eduardo Apolinario <[email protected]>
fiedlerNr9 pushed a commit that referenced this pull request Jul 25, 2024
…xin` for dataclass (#2279)

Signed-off-by: Future-Outlier <[email protected]>
Co-authored-by: Thomas J. Fan <[email protected]>
Co-authored-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Jan Fiedler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm This PR has been approved by maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants