-
Notifications
You must be signed in to change notification settings - Fork 670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core Feature] Logical types: static type checking for higher level user defined types. #1363
Comments
One more requirement: LogicalTypes should be able to support meta-outputs that are associated with the core type. For example, when you run a Great Expectations assertion, the result is a markdown or an HTML file that has the results. Flyte does support having multiple outputs for a task, but some outputs can automatically be associated with some meta data, like in this case the test suite results. IMO, the LogicalTypes should carry this information with them and FlyteConsole or other clients can show them separately if needed. Another example could be a DataSet that has an index with it. this is like a Multipart Blob, but also has an additional file (e.g. a JSON or a CSV) that contains the list of all elements in the multi-part. this can help in providing transactional semantics on multipart directories |
Yee, Ketan, Eduardo and I met to discuss more about Logical Types and get on the same page... This is the type changes we are proposing: The goal of Logical Types is to enable different SDKs to reason about higher level types the same way. For example, if users define a BigInt type in flyteKit, the LiteralType should have enough information for FlyteKit Remote to map the literal type back to a python's BigInt class... This also allows flyteConsole to have special visualizations for BigInt. Additional changes:
message LogicalTypeInfo {
// Required. Unique resource name for LogicalType.
// There is a list of well-known logical types supported by SDKs,
// and users can add their own
string urn = 1;
string friendly_name = 2;
string origin_type = 3;
map<string, string> labels = 4;
// do we need to add argument? argument_type?
}
// Defines a strong type to allow type checking between interfaces.
message LiteralType {
oneof type {
// A simple type that can be compared one-to-one with another.
SimpleType simple = 1;
// A complex type that requires matching of inner fields.
SchemaType schema = 2;
// Defines the type of the value of a collection. Only homogeneous collections are allowed.
LiteralType collection_type = 3;
// Defines the type of the value of a map type. The type of the key is always a string.
LiteralType map_value_type = 4;
// A blob might have specialized implementation details depending on associated metadata.
BlobType blob = 5;
// Defines an enum with pre-defined string values.
EnumType enum_type = 7;
}
// This field contains type metadata that is descriptive of the type, but is NOT considered in type-checking. This might be used by
// consumers to identify special behavior or display extended information for the type.
google.protobuf.Struct metadata = 6;
LogicalTypeInfo logical_type = 7;
} |
Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏 |
Maybe we should keep this issue in a deprioritized state |
Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable. |
Motivation: Why do you think this is important?
Flytekit and in the future other SDK's support progressive typing and allowing users to define their types. The
TypeTransformers
today in flytekit, effectively result in type-erasure at runtime. The higher level types are converted to underlying flyte types and on retrieval the information of the source type is lost. This works in theory as the receiving sdk, has the right types defined. It also helps in easy type-casting types into all of its derivative types. This technique has been successfully deployed to a language like Java and the JVM.Examples of type derivatives
Convert from Spark Data frame -> Flyte.Schema -> Pandas data frame.
But, it is desirable to keep the source type available so that we can recover the type, even without explicitly requesting for this type.
Example:
remote.get().outputs.x -> can be correctly casted if available
Moreover, one problem with type erasure is loss of static type checking across languages or different tasks.
To overcome this problem the issue proposes we introduce a new type called the LogicalType, which keeps information about the source and the transport type associated.
Goal: What should the final outcome look like, ideally?
Users can specify new types, and we can reverse engineer those types from the stored definition. Helps in debugging, static type assertions, optimizations and helps extensibility
Describe alternatives you've considered
What exists today - type erasure!
[Optional] Propose: Link/Inline OR Additional context
-- from @kanterov
Logical type is a type alias for an existing LiteralType, and values for logical types are represented with existing Literal. Logical types can correspond to built-in or user-defined types in SDK. A logical type is defined as (this approach is inspired by Apache Beam proto):
Example of urn
Semantics
Type t1 is supertype of logical type t2, iff:
t1 is strictly equal to t2
t1 is supertype of t3, and t3 is supertype of t2
t1 is supertype of t2.representation
This allows us to read unknown logical types using their representation. E.g. if task_1 produces output: LogicalType(representation=INTEGER) and task_2 has input of INTEGER, it’s possible to bind task_2.input to task_1.output. However, it isn’t possible to do the opposite: use any INTEGER as LogicalType(representation=INTEGER).
SDKs have a list of well-known logical types that are mapped to built-in or custom types. flyteconsole or flytectl can have a special behaviour for well-known logical types.
flytepropeller shouldn’t introduce a special behaviour for well-known logical types when doing type-checking. This limitation of logical types allows the introduction of new logical types without all components of Flyte being aware of it. When there is an unknown logical type, it should be safe for implementation to fallback to it’s representation.
Examples of well-known logical types
Example: introducing INT32
flyteidl has an INTEGER type that is 64-bit integer. It’s natural for SDK users to use 32 bit integers unless they need 64 bits. In Java, there are two separate types: Integer and Long representing 32 and 64 bit integers. However, it creates a problem because a 32 bit integer can overflow when trying to fit 64 bits. Introducing logical type for INT32 allows tasks to read INT32, only if input is bound to a literal that is known to be INT32.
The text was updated successfully, but these errors were encountered: