Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

types use new for generics. no metaprogramming #4016

Merged
merged 16 commits into from
Feb 22, 2022
Merged

types use new for generics. no metaprogramming #4016

merged 16 commits into from
Feb 22, 2022

Conversation

CircArgs
Copy link
Contributor

@CircArgs CircArgs commented Jan 31, 2022

This PR is in contrast to #3981 which sought to use a pattern similar to the python typings e.g. List, Dict, Union, etc to create types that could be used inside iceberg and also server as types that could be statically checked when used to type code.

After discussions with @samredai and @rdblue I've revised it further so there is no real metaprogramming yet we still get much of the value.

The same syntax as the current code is used to create types:

IntegerType(), StructType( [ NestedField(True, 1, "required_field", StringType()), NestedField(False, 2, "optional_field", IntegerType()), ] )

yet we get == for free (no dedicated __eq__ methods) and can use isinstance to check types instead of issubclass as was the case in #3981.

Take this example:

>>> FixedType(length=8) is FixedType(length=8) # same object in memory
True

>>> str(IntegerType())
integer

>>> IntegerType() is IntegerType() # same object in memory
True

>>> repr(BooleanType())
BooleanType()

>>> repr(StructType(
        [
            NestedField(True, 1, "required_field", StringType()),
            NestedField(False, 2, "optional_field", IntegerType()),
        ]
    ))
StructType(fields=(NestedField(is_optional=True, field_id=1, name='required_field', field_type=StringType(), doc=None), NestedField(is_optional=False, field_id=2, name='optional_field', field_type=IntegerType(), doc=None)))

>>> str(StructType(
        [
            NestedField(True, 1, "required_field", StringType()),
            NestedField(False, 2, "optional_field", IntegerType()),
        ]
    ))
struct<[nestedfield<True, 1, required_field, string, None>, nestedfield<False, 2, optional_field, integer, None>]>

>>> StructType(
        [
            NestedField(True, 1, "required_field", StringType()),
            NestedField(False, 2, "optional_field", IntegerType()),
        ]
    )==StructType(
        [
            NestedField(True, 1, "required_field", StringType()),
            NestedField(False, 2, "optional_field", IntegerType()),
        ]
    )
True 

>>> StructType(
        [
            NestedField(True, 1, "required_field", StringType()),
            NestedField(False, 2, "optional_field", IntegerType()),
        ]
    )==StructType(
        [
            NestedField(True, 0, "required_field", StringType()), # id changed from 1 to 0
            NestedField(False, 2, "optional_field", IntegerType()),
        ]
    ) 
False 

>>> isinstance(StringType(), StringType)
True

This types.py is about 100 lines less code than the current one with greater functionality as described

The centerpiece of this PR is simply the __new__ method on the base IcebergType which checks the attribute _implemented: Dict[Tuple[str, Tuple[Any]], "IcebergType"] which you can see keeps track of IcebergType instances by storing keys to the type's name and attributes (as defined it the init)

Thank you to @samredai and @rdblue for helping to inspire this change.

Note: if this PR is accepted #3981 should be closed

python/src/iceberg/types.py Outdated Show resolved Hide resolved
python/src/iceberg/types.py Outdated Show resolved Hide resolved
python/src/iceberg/types.py Outdated Show resolved Hide resolved
python/tests/test_types.py Outdated Show resolved Hide resolved
python/src/iceberg/types.py Outdated Show resolved Hide resolved

_implemented = {} # type: ignore

def __new__(cls):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it is better to move this logic into a Singleton class and any class can extend it to be a singleton.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think the benefits of that might be?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible reuse later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably continue to use Singleton so that you don't have to call object.__new__ and can use super().__new__ instead. That seems safer to me for some reason.

if cls in cls._implemented:
return cls._implemented[cls]
else:
ret = object.__new__(cls)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep a set here? Seems just _instance = None should be OK, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a dictionary, but I put it there so that get would work otherwise I have to first check for None right?

doc: Optional[str] = None,
):
key = is_optional, field_id, name, field_type, doc
cls._implemented[key] = cls._implemented.get(key, object.__new__(cls))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this line, will it always create a new object, object.__new__(cls) and pass it to get method call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right. Unnecessarily inefficient. changed it now to use or instead

@rdblue
Copy link
Contributor

rdblue commented Feb 2, 2022

This is looking good, but I just committed @jun-he's change that adds eq since it was simple and unblocks him. Can you rebase and merge in those changes? That has a Singleton class that can be used in a couple places.

CircArgs and others added 3 commits February 3, 2022 09:00
add missing docstrings; comments in __new__

icebergtype back to type; new on all classes; manual str, repr

remove instantiating a new object each check in __new__

implemented to instances

Type back to IcebergType

add whitespace before/after Example in docstrings
@samredai
Copy link
Collaborator

samredai commented Feb 3, 2022

I took a look at this test failure and it looks like it might be transient. I suspect rerunning the 3.8 test should succeed. That being said I noticed that --diff is not included so we don't see any details in the logs on which lines are causing the lint failure so I opened PR #4034 to add that.

@CircArgs
Copy link
Contributor Author

CircArgs commented Feb 3, 2022

@rdblue Merged in those changes. I think it's in line with the discussion @jun-he and I were having in his PR. Singleton is as it was in his

@CircArgs CircArgs requested a review from rdblue February 4, 2022 17:07
python/src/iceberg/types.py Outdated Show resolved Hide resolved
python/src/iceberg/types.py Outdated Show resolved Hide resolved

def __new__(cls, precision: int, scale: int):
key = precision, scale
cls._instances[key] = cls._instances.get(key) or object.__new__(cls)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this need to pass precision and scale in for __init__?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not in __new__'s semantics to deal with initialization. It could be done but I think that's bad practice and it's left to __init__

Copy link
Contributor

@rdblue rdblue Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I understand. So Python is still going to call __init__ after this method on the object that this returns?

Is it a concern that __init__ is called every time this is returned?


class NestedField(IcebergType):
"""equivalent of `NestedField` type from Java implementation"""
Copy link
Contributor

@rdblue rdblue Feb 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc strings should use normal sentence case, so this should start with a capital letter.

Also, I don't think that we want to refer to the Java implementation. This represents a field of a struct, a map key, a map value, or a list element. This is where field IDs, names, docs, and nullability are tracked.

python/src/iceberg/types.py Outdated Show resolved Hide resolved

class StructType(Type):
def __init__(self, fields: list):
def __new__(cls, fields: List[NestedField]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to change this to *fields instead of List[NestedField]? That would allow a bit more natural syntax:

s: StructType = StructType(NestedField(True, 1, "col", StringType()))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure is. looks much better this way. all set now

@CircArgs CircArgs requested a review from rdblue February 7, 2022 16:21
@CircArgs CircArgs requested a review from jun-he February 14, 2022 15:10

class NestedField(IcebergType):
"""This represents a field of a struct, a map key, a map value, or a list element. This is where field IDs, names, docs, and nullability are tracked."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc string should be wrapped at 130 characters, right?


Example:
>>> MapType(key_id=1, key_type=StringType(), value_id=2, value_type=IntegerType(), value_is_optional=True)
MapType(key=NestedField(is_optional=False, field_id=1, name='key', field_type=StringType(), doc=None), value=NestedField(is_optional=True, field_id=2, name='value', field_type=IntegerType(), doc=None))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this also needs to be wrapped to the line length.

):
super().__init__(
f"list<{element_type}>",
f"ListType(element_is_optional={element_is_optional}, element_id={element_id}, "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: It would be nice if the argument order were consistent. This is slightly misleading because element_is_optional is last if you're passing by position.

python/src/iceberg/types.py Outdated Show resolved Hide resolved

_instances: Dict[Tuple[bool, int, str, IcebergType, Optional[str]], "NestedField"] = {}

def __new__(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing this out in ipython, I missed a couple args and got this error message:

__new__() missing 2 required positional arguments: 'name' and 'field_type'

Is there a way to improve that? Or is that helpful enough?

python/src/iceberg/types.py Outdated Show resolved Hide resolved
python/src/iceberg/types.py Outdated Show resolved Hide resolved
@rdblue
Copy link
Contributor

rdblue commented Feb 15, 2022

Thanks, @CircArgs! Just a couple minor fixes to go and then we can get this in. The only semi-major thing is my question about __init__: is that called every time we return a type? Or does Python keep track and only call it once for the object?

@CircArgs
Copy link
Contributor Author

Hey @rdblue the __init__ will be called every time we request an instance of the type. First __new__ will give us the instance (which the custom __new__ on the class or inherited from Singleton will give) and then __init__ will be called. It's true for both the parameterized typed and the non-parameterized ones.

@CircArgs
Copy link
Contributor Author

@CircArgs, FYI. I tried to open a PR against your branch, but I couldn't so I had to create a new PR.

Feel free to pick the changes into your branch if you want to commit the other PR.

Thanks a lot @rdblue. I've pulled in your commits here. Sorry I wasn't quite on the same page with what you were thinking, makes sense with the flag in your changes

@samredai
Copy link
Collaborator

I commented in the other PR and I see those commits were added here, this LGTM. Thanks @CircArgs and @rdblue!

@rdblue rdblue merged commit 818d9a5 into apache:master Feb 22, 2022
@rdblue
Copy link
Contributor

rdblue commented Feb 22, 2022

Thanks, @CircArgs! Great to have this in. @jun-he, can you update the transforms PR? I think that one is next.

arminnajafi pushed a commit to arminnajafi/iceberg that referenced this pull request Feb 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants