Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generics and associated tests - no literals #3981

Closed
wants to merge 2 commits into from
Closed

generics and associated tests - no literals #3981

wants to merge 2 commits into from

Conversation

CircArgs
Copy link
Contributor

@CircArgs CircArgs commented Jan 25, 2022

This PR is in contrast to #3952 which saw an attempt to use optional arguments to get a semblance of combined type/literal functionality. This one steps back to offer true generics e.g. FixedType[8] as opposed to FixedType(length=8) with all the intuitive functionality that comes from using true types and not class instances.

Examples:

IntegerType==IntegerType
DecimalType[32, 3]==DecimalType[32, 3]
StructType[
        NestedField[DecimalType[32, 3], 0, "c1", True],
        NestedField[FloatType, 1, "c2", False],
    ] ==
StructType[
        NestedField[DecimalType[32, 3], 0, "c1", True],
        NestedField[FloatType, 1, "c2", False],
    ]
>>> issubclass(NestedField[DecimalType[32, 3], 0, "c1", True], NestedField)
True

>>> issubclass(StructType[
        NestedField[DecimalType[32, 3], 0, "c1", True],
        NestedField[FloatType, 1, "c2", False],
    ], StructType[
        NestedField[DoubleType, 0, "c1", True],
        NestedField[FloatType, 1, "c2", False],
    ])
False

>>>issubclass(StructType[
        NestedField[DecimalType[32, 3], 0, "c1", True],
        NestedField[FloatType, 1, "c2", False],
    ], StructType[
        NestedField[IcebergType, 0, "c1", True],
        NestedField[FloatType, 1, "c2", False],
    ])
True

It provides the functionality to create further types possibly necessary in the future such as the transforms. For example,

>>> BucketTransform = generic_class(
    "BucketTransform",
   {
            "type": Union[
                DateType,
                TimeType,
                TimestampType,
                TimestamptzType,
                StringType,
                BinaryType,
                UUIDType,
                NumberType,
                FixedType,
            ],
        "num_buckets", int})

>>>BucketTransform[IntegerType, 10]
BucketTransform[type=IntegerType, num_buckets=10]


"""Data types used in describing Iceberg schemas
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the newline back so git doesn't think this line changed?

@emkornfield
Copy link
Contributor

While this seems like cool functionality, I'm a bit ambivalent of conflating the iceberg type system with the python type system, I can't think of any immediate drawbacks other then it isn't clear if the issubclass relation is really the right relation to be using between types. For instance the last example:

>>>issubclass(StructType[
        NestedField[DecimalType[32, 3], 0, "c1", True],
        NestedField[FloatType, 1, "c2", False],
    ], StructType[
        NestedField[IcebergType, 0, "c1", True],
        NestedField[FloatType, 1, "c2", False],
    ])
True

strikes me as not immediately what the semantics of a StructType inheritance should be and if inheritance in this case really makes sense?

Is there a design doc someplace that discusses this approach? Are there other projects that have gone this route that have had good results? Does this make important constructs much easier to express?

@rdblue
Copy link
Contributor

rdblue commented Jan 29, 2022

@emkornfield, I just had a discussion with @CircArgs about this yesterday. I think that we're currently evaluating whether integration with Python's type system is necessary and worth the additional complexity. It seems like the main benefit is to be able to have functions like transforms that accept a certain argument type and to make the literals conform to that type. But given how we will process actual data (and not just literal values in in expressions) won't use type safety because the data is in a dataframe or some direct representation (like a python int rather than a wrapper around int), it seems fairly limited.

I think we may want to go with a simpler type implementation, but @CircArgs was going to give it some thought and weigh in.

@emkornfield
Copy link
Contributor

@rdblue thanks for the update. Was this conversation on slack or offline. I'm trying to ramp up on the with the community is doing and trying to figure out which communication channels are critical to monitor?

@rdblue
Copy link
Contributor

rdblue commented Jan 30, 2022

@emkornfield, we had an offline chat because I didn't understand the goal of this approach at first. Most of the time, we coordinate through PRs and the #python channel in the apache-iceberg Slack community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants