Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leverage pandas' ExtensionDtype for defining efficient new types #76

Open
sbrugman opened this issue May 22, 2020 · 3 comments
Open

Leverage pandas' ExtensionDtype for defining efficient new types #76

sbrugman opened this issue May 22, 2020 · 3 comments
Labels
enhancement New feature or request

Comments

@sbrugman
Copy link
Contributor

sbrugman commented May 22, 2020

Visions' currently supports defining custom types, such as Path, File and URL. These types inherit from object and are stored as uniquely defined classes. This for instance means that URL is stored as the namedtuple ParseResult that is returned by urlparse.

This strategy is effective in application where the series was converted to the object type anyway and doesn't pose a problem to small to medium sized datasets. For larger datasets we should consider an additional strategy, where a new (d)type is created as alias for an existing pandas.dtype. Allowing for these kind of abstractions addresses one of the major shortcomings in pandas at the moment. Custom dtypes generally reduces the memory complexity and the computational complexity of membership checks from O(n) to O(1). The same functionality could be maintained through an accessor (series.path just like series.dt).

Two implementation considerations:

  • pandas' StringDtype and ExtensionDtype are experimental and may change. The code for this enhancement should therefore be a minimal layer over the pandas interface.
  • The StringDType was introduced in pandas v1.0.0. The ExtensionDType however, was introduced earlier. Visions should provide backwards compability.

A type-agnostic solution is proposed in the linked PR.

@sbrugman sbrugman added the enhancement New feature or request label May 22, 2020
@sbrugman
Copy link
Contributor Author

sbrugman commented May 22, 2020

@jamesmyatt
Copy link

Might be worth look at cyberpandas (https://github.com/ContinuumIO/cyberpandas) which implements an IPAddress extension array.

@sbrugman
Copy link
Contributor Author

@jamesmyatt Thanks for thinking along! cyberpandas is an exellent demonstration of how adding new types can be useful. On the other hand, it demonstrates how involves adding a type can get with pandas. The pandas devs are (currently) not keen on supporting subclassing of other ExtensionDtypes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants