Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Allow Nested Complex Data Types for Types.Schema #523

Closed
3 of 13 tasks
DobsX opened this issue Sep 25, 2020 · 4 comments
Closed
3 of 13 tasks

[Feature] Allow Nested Complex Data Types for Types.Schema #523

DobsX opened this issue Sep 25, 2020 · 4 comments
Labels
enhancement New feature or request flyteidl flytekit FlyteKit Python related issue flytepropeller

Comments

@DobsX
Copy link

DobsX commented Sep 25, 2020

Motivation: Why do you think this is important?
Currently Flyte does not have support for complex and nested data structures for Types.Schema. There’s currently no native way to select columns with these types into supported Type currently and would have to resort to storing as one of the existing Types and dealing with it on the client side.

One work-around is to read/write with a generic Types.Schema() schema but it offers no type-checking and type handling is handled by the client-side.

Additionally, if you are using Flyte for an ETL workflow, you don’t have a way to write the results back to the DB in the same format.

Here is a related issue: #22

Goal: What should the final outcome look like, ideally?
Basically to support maps/arrays and to have these maps/arrays be also nested into maps/arrays. Furthermore, if you have a Types.Schema data object, you can then store it on S3 and utilize the various functions/plugins to load your data into a DB as it's native type.

Describe alternatives you've considered
Current alternative is to serialize to a JSON string, and then deserialize when you consume the data.

Flyte component

  • Overall
  • Flyte Setup and Installation scripts
  • Flyte Documentation
  • Flyte communication (slack/email etc)
  • FlytePropeller
  • FlyteIDL (Flyte specification language)
  • Flytekit (Python SDK)
  • FlyteAdmin (Control Plane service)
  • FlytePlugins
  • DataCatalog
  • FlyteStdlib (common libraries)
  • FlyteConsole (UI)
  • Other

[Optional] Propose: Link/Inline
N/A

Additional context

Hive Types:

  • ARRAY
  • MAP
  • STRUCT

Presto Types:

  • ARRAY
  • MAP
  • ROW

BigQuery Types:

  • REPEATED (aka array)
  • RECORD (aka map)

The current list of Flyte types are:

    class SchemaColumnType(object):
        INTEGER = _types_pb2.SchemaType.SchemaColumn.INTEGER
        FLOAT = _types_pb2.SchemaType.SchemaColumn.FLOAT
        STRING = _types_pb2.SchemaType.SchemaColumn.STRING
        DATETIME = _types_pb2.SchemaType.SchemaColumn.DATETIME
        DURATION = _types_pb2.SchemaType.SchemaColumn.DURATION
        BOOLEAN = _types_pb2.SchemaType.SchemaColumn.BOOLEAN

Here is an example in Presto:

Presto SQL:

WITH z AS (
    SELECT
        '2020-04-25' AS ds,
        1 AS row_num,
        'gcp' AS service,
        MAP(x.fruit, x.goodness) AS fruit_mapping,
        ARRAY[1, 2, 3, 4] AS array_nums
    FROM
        (
            SELECT
                ARRAY['banana', 'orange', 'watermelon'] AS fruit,
                ARRAY['good', 'okay', 'amazing'] AS goodness
        ) x
        
    UNION ALL
    
    SELECT
        '2020-04-25' AS ds,
        2 AS row_num,
        'aws' AS service,
        MAP(x.fruit, x.goodness),
        ARRAY[7, 8, 9] AS array_nums
    FROM
        (
            SELECT
                ARRAY['peach', 'tomato', 'potato'] AS fruit,
                ARRAY['good', 'fake fruit', 'not a fruit'] AS goodness
        ) x
)

SELECT
    z.ds,
    z.row_num,
    z.service,
    CAST(
        ROW(z.fruit_mapping, z.array_nums)
        AS ROW(my_map MAP(VARCHAR, VARCHAR), my_array ARRAY(BIGINT))
    ) AS my_row_example
FROM
    z

Here is what the schema would look like based on the sql:

CREATE TABLE hive.test.test_dobs_row (
    ds varchar(10),
    row_num integer,
    service varchar(3),
    my_row_example ROW(
        my_map map(varchar, varchar),
        my_array array(bigint)
    )
) WITH ( format = 'PARQUET' );

And here's what it would like in JSON:

[

    {
        "ds": "2020-04-25",
        "row_num": 1,
        "service": "gcp",
        "my_row_example": {
            "my_map": {
                "banana": "good",
                "orange": "okay",
                "watermelon": "amazing"
            },
            "my_array": [1, 2, 3, 4]
        }
    },

    {
        "ds": "2020-04-25",
        "row_num": 2,
        "service": "aws",
        "my_row_example": {
            "my_map": {
                "potato": "not a fruit",
                "peach": "good",
                "tomato": "fake fruit"
            },
            "my_array": [7, 8, 9]
        }
    }

]

Is this a blocker for you to adopt Flyte
Currently no, but nested structures with arrays and maps are becoming more popular in terms of usage.

@DobsX DobsX added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Sep 25, 2020
@EngHabu
Copy link
Contributor

EngHabu commented Oct 21, 2020

I totally agree this is a limitation (and possibly an oversight) in the FlyteIdl design.

I think it's worth reevaluating the decision of separating the supported column types in schemas as a strict subset of the supported FlyteIdl Literal Types. It might be worth departing from the current structure of schemas and introduce a lighter weight type (e.g. Rows / Records / Columns) that carries metadata about its columns names and types where the types can be any LiteralType supported by FlyteIdl. We've briefly discussed that in the past but the use-cases weren't there to support the investment.

I know we have discussed this internally before @DobsX and I thank you for articulating the problem and the request. Is it something you might be able to spare time to work on? I'll be happy to help guide the implementation and provide as much context as I can... As Presto, BigQuery and Hive gain more and more traction on Flyte, I know a lot of people will appreciate the flexibility (and not compromise type safety). Please let me know...

@DobsX
Copy link
Author

DobsX commented Oct 22, 2020

@EngHabu I definitely want to work on this, along with some other Flyte related features. I should have some time in the next couple of months. But FYI, likely won't be able to start immediately.

Regardless, thank you for the checking this request out and happy to hear that this is something we do want to support.

@wild-endeavor wild-endeavor added flyteidl flytekit FlyteKit Python related issue flytepropeller and removed untriaged This issues has not yet been looked at by the Maintainers labels Jan 4, 2022
@wild-endeavor
Copy link
Contributor

this is at least partially implemented with flyteorg/flytekit#785

@wild-endeavor
Copy link
Contributor

closing this issue in favor of structured datasets

eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
* fix broken links, content clean-up

Signed-off-by: Samhita Alla <[email protected]>

* nit

Signed-off-by: Samhita Alla <[email protected]>

* incorporate suggestions @cosmicBboy

Signed-off-by: Samhita Alla <[email protected]>

* isort

Signed-off-by: Samhita Alla <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request flyteidl flytekit FlyteKit Python related issue flytepropeller
Projects
None yet
Development

No branches or pull requests

3 participants