Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-134 Cannot encode pandas NA objects #118

Merged
merged 6 commits into from
Jan 11, 2023

Conversation

blink1073
Copy link
Member

No description provided.

bindings/python/pymongoarrow/api.py Outdated Show resolved Hide resolved
"""A custom type codec for Pandas NA objects."""

python_type = NA.__class__ # type:ignore[assignment]
bson_type = None # type:ignore[assignment]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, why the type ignores?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My editor had flagged them because the base class uses read-only properties.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see mypy does not raise this issue. This does look like a bug in pyright though:

  /Users/shane/git/mongo-python-driver/ignore.py:15:19 - error: Expression of type "python_type" cannot be assigned to declared type "property"
    "Type[type]" is incompatible with "Type[property]" (reportGeneralTypeIssues)

Looks like microsoft/pyright#2601 which they closed as won't fix.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that pyright issue has @classmethod whereas we just use @property so I think it would be worth opening a new pyright issue for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codec_options = collection.codec_options
if DataFrame is not None:
type_registry = TypeRegistry([_PandasNACodec()])
codec_options = codec_options.with_options(type_registry=type_registry)
Copy link
Collaborator

@ShaneHarvey ShaneHarvey Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's ideal to completely replace the collection's type_registry. What if the app already configured a type_registry for other types they want to encode?

Would it be possible to keep the existing registry but add the _PandasNACodec?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but we'd have to use private APIs from TypeRegistry.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my mistake for assuming TypeRegistry would have the ability to add/edit a type. You know, like a registry. Could you open an ARROW ticket for this and backlog it? I'd say it's low priority.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codec_options = collection.codec_options
if DataFrame is not None:
type_registry = TypeRegistry([_PandasNACodec()])
codec_options = codec_options.with_options(type_registry=type_registry)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my mistake for assuming TypeRegistry would have the ability to add/edit a type. You know, like a registry. Could you open an ARROW ticket for this and backlog it? I'd say it's low priority.

"""A custom type codec for Pandas NA objects."""

python_type = NA.__class__ # type:ignore[assignment]
bson_type = None # type:ignore[assignment]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see mypy does not raise this issue. This does look like a bug in pyright though:

  /Users/shane/git/mongo-python-driver/ignore.py:15:19 - error: Expression of type "python_type" cannot be assigned to declared type "property"
    "Type[type]" is incompatible with "Type[property]" (reportGeneralTypeIssues)

Looks like microsoft/pyright#2601 which they closed as won't fix.

"""A custom type codec for Pandas NA objects."""

python_type = NA.__class__ # type:ignore[assignment]
bson_type = None # type:ignore[assignment]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that pyright issue has @classmethod whereas we just use @property so I think it would be worth opening a new pyright issue for it.

val = outgoing[name]
if str(val.dtype) in ["object", "float64"]:
val = val.astype(col.dtype)
pd.testing.assert_series_equal(col, val)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is using "val" as the name of a column idomatic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

for name in incoming.columns:
col = incoming[name]
val = outgoing[name]
if str(val.dtype) in ["object", "float64"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment here or a method docstring would be helpful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return datetime.datetime(1970, 1, 1)
return k.replace(tzinfo=None)

data["datetime"] = data.apply(lambda row: new_replace(row["datetime"]), axis=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That pandas ticket seems to be about .time() not working on NA types but we don't use .time() anywhere. Can you explain?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, meant to link to pandas-dev/pandas#16248. Also added context.

Copy link
Collaborator

@ShaneHarvey ShaneHarvey Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still confused since this code is changing the data before passing it to any pymongoarrow methods. What happens if the user actually calls write() with a NaT datetime? And why are we clamping the datetime to datetime.datetime(1970, 1, 1)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They would see the same error as https://jira.mongodb.org/browse/FREE-165786, for which this is a workaround. I just chose a random valid date.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, 2 follow up questions.

  1. Can we change this to something like?:
if isinstance(k, Nat):
   return None
return k
  1. this looks like something we should ideally be working around in write() itself. Can you open a ticket for making write() encode NaT datetimes as BSON null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we return None we get the original error, so I had to return the workaround datetime object. I filed https://jira.mongodb.org/browse/ARROW-136.

@blink1073 blink1073 merged commit 60e7706 into mongodb-labs:main Jan 11, 2023
@blink1073 blink1073 deleted the ARROW-134 branch January 11, 2023 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants