Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core Feature] Logical types: static type checking for higher level user defined types. #1363

Open
kumare3 opened this issue Aug 19, 2021 · 5 comments
Assignees
Labels
enhancement New feature or request flytekit FlyteKit Python related issue stale

Comments

@kumare3
Copy link
Contributor

kumare3 commented Aug 19, 2021

Motivation: Why do you think this is important?
Flytekit and in the future other SDK's support progressive typing and allowing users to define their types. The TypeTransformers today in flytekit, effectively result in type-erasure at runtime. The higher level types are converted to underlying flyte types and on retrieval the information of the source type is lost. This works in theory as the receiving sdk, has the right types defined. It also helps in easy type-casting types into all of its derivative types. This technique has been successfully deployed to a language like Java and the JVM.

Examples of type derivatives
Convert from Spark Data frame -> Flyte.Schema -> Pandas data frame.

But, it is desirable to keep the source type available so that we can recover the type, even without explicitly requesting for this type.

Example:
remote.get().outputs.x -> can be correctly casted if available

Moreover, one problem with type erasure is loss of static type checking across languages or different tasks.

To overcome this problem the issue proposes we introduce a new type called the LogicalType, which keeps information about the source and the transport type associated.

Goal: What should the final outcome look like, ideally?
Users can specify new types, and we can reverse engineer those types from the stored definition. Helps in debugging, static type assertions, optimizations and helps extensibility

Describe alternatives you've considered
What exists today - type erasure!

[Optional] Propose: Link/Inline OR Additional context
-- from @kanterov
Logical type is a type alias for an existing LiteralType, and values for logical types are represented with existing Literal. Logical types can correspond to built-in or user-defined types in SDK. A logical type is defined as (this approach is inspired by Apache Beam proto):

message LogicalType {
  // Required. Unique resource name for LogicalType.
  // There is a list of well-known logical types supported by SDKs, 
  // and users can add their own
  string urn = 1; 
  
  // Required. Existing LiteralType used to represent values of LogicalType
  LiteralType representation = 2;

  // Optional. Additional argument for logical type. May be used to serialize additional information
  Literal argument = 3;

  // Optional. Type of argument.
  LiteralType argument_type = 4;
}

Example of urn

pandas.DataFrame, pyspark.DataFrame

Semantics
Type t1 is supertype of logical type t2, iff:
t1 is strictly equal to t2
t1 is supertype of t3, and t3 is supertype of t2
t1 is supertype of t2.representation

This allows us to read unknown logical types using their representation. E.g. if task_1 produces output: LogicalType(representation=INTEGER) and task_2 has input of INTEGER, it’s possible to bind task_2.input to task_1.output. However, it isn’t possible to do the opposite: use any INTEGER as LogicalType(representation=INTEGER).

SDKs have a list of well-known logical types that are mapped to built-in or custom types. flyteconsole or flytectl can have a special behaviour for well-known logical types.

flytepropeller shouldn’t introduce a special behaviour for well-known logical types when doing type-checking. This limitation of logical types allows the introduction of new logical types without all components of Flyte being aware of it. When there is an unknown logical type, it should be safe for implementation to fallback to it’s representation.

Examples of well-known logical types

  • INT32 (represented as INTEGER)
  • FIXEDBYTES(N) (represented as BINARY): argument type is INTEGER, representing length of fixed byte array
  • LOCAL DATE (represented as DATETIME): date without timezone
  • LOCAL DATETIME (represented as DATETIME): datetime without timezone
  • DECIMAL(P, D) (represented as BYTES): argument_type is {p: INTEGER, d: INTEGER}, where p is precision, and d is the number of digits after decimal points)

Example: introducing INT32
flyteidl has an INTEGER type that is 64-bit integer. It’s natural for SDK users to use 32 bit integers unless they need 64 bits. In Java, there are two separate types: Integer and Long representing 32 and 64 bit integers. However, it creates a problem because a 32 bit integer can overflow when trying to fit 64 bits. Introducing logical type for INT32 allows tasks to read INT32, only if input is bound to a literal that is known to be INT32.

@kumare3 kumare3 added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers and removed untriaged This issues has not yet been looked at by the Maintainers labels Aug 19, 2021
@wild-endeavor wild-endeavor added the flytekit FlyteKit Python related issue label Aug 19, 2021
@kumare3
Copy link
Contributor Author

kumare3 commented Sep 9, 2021

One more requirement: LogicalTypes should be able to support meta-outputs that are associated with the core type. For example, when you run a Great Expectations assertion, the result is a markdown or an HTML file that has the results. Flyte does support having multiple outputs for a task, but some outputs can automatically be associated with some meta data, like in this case the test suite results. IMO, the LogicalTypes should carry this information with them and FlyteConsole or other clients can show them separately if needed.

Another example could be a DataSet that has an index with it. this is like a Multipart Blob, but also has an additional file (e.g. a JSON or a CSV) that contains the list of all elements in the multi-part. this can help in providing transactional semantics on multipart directories

@EngHabu
Copy link
Contributor

EngHabu commented Nov 3, 2021

Yee, Ketan, Eduardo and I met to discuss more about Logical Types and get on the same page... This is the type changes we are proposing:

The goal of Logical Types is to enable different SDKs to reason about higher level types the same way. For example, if users define a BigInt type in flyteKit, the LiteralType should have enough information for FlyteKit Remote to map the literal type back to a python's BigInt class... This also allows flyteConsole to have special visualizations for BigInt.

Additional changes:

  1. Add metadata field to literal message. This allows us to attach struct/json metadata to the Literals (e.g. Great Expectations validation on an input)
  2. Add a field to the LiteralType to indicate whether it's an optional type. (e.g. this allows Optional[int] in python)
message LogicalTypeInfo {
  // Required. Unique resource name for LogicalType.
  // There is a list of well-known logical types supported by SDKs, 
  // and users can add their own
  string urn = 1; 

  string friendly_name = 2;

  string origin_type = 3;

  map<string, string> labels = 4;

  // do we need to add argument? argument_type?
}

// Defines a strong type to allow type checking between interfaces.
message LiteralType {
    oneof type {
        // A simple type that can be compared one-to-one with another.
        SimpleType simple = 1;

        // A complex type that requires matching of inner fields.
        SchemaType schema = 2;

        // Defines the type of the value of a collection. Only homogeneous collections are allowed.
        LiteralType collection_type = 3;

        // Defines the type of the value of a map type. The type of the key is always a string.
        LiteralType map_value_type = 4;

        // A blob might have specialized implementation details depending on associated metadata.
        BlobType blob = 5;

        // Defines an enum with pre-defined string values.
        EnumType enum_type = 7;
    }

    // This field contains type metadata that is descriptive of the type, but is NOT considered in type-checking.  This might be used by
    // consumers to identify special behavior or display extended information for the type.
    google.protobuf.Struct metadata = 6;

    LogicalTypeInfo logical_type = 7;
}

@kumare3 kumare3 added this to the 1.0.0 - Phoenix! milestone Nov 13, 2021
@EngHabu EngHabu modified the milestones: 1.0.0 - Phoenix!, 1.0.1 Mar 9, 2022
@EngHabu EngHabu removed this from the 1.0.1 milestone May 4, 2022
@github-actions
Copy link

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

@github-actions github-actions bot added the stale label Aug 26, 2023
@kumare3
Copy link
Contributor Author

kumare3 commented Aug 26, 2023

Maybe we should keep this issue in a deprioritized state

@github-actions github-actions bot removed the stale label Aug 27, 2023
Copy link

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable.
Thank you for your contribution and understanding! 🙏

@github-actions github-actions bot added the stale label May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request flytekit FlyteKit Python related issue stale
Projects
None yet
Development

No branches or pull requests

4 participants