Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for API Extensions #1465

Closed
scook12 opened this issue May 6, 2020 · 5 comments · Fixed by #1617
Closed

Support for API Extensions #1465

scook12 opened this issue May 6, 2020 · 5 comments · Fixed by #1617
Labels
discussions enhancement New feature or request

Comments

@scook12
Copy link
Contributor

scook12 commented May 6, 2020

Issue

pandas exposes a pretty simple API to let library developers extend pandas objects via registering accessors. It would be awesome if koalas would support a similar feature.

Resources

Docs: https://pandas.pydata.org/pandas-docs/stable/reference/extensions.html

Public API: https://github.com/pandas-dev/pandas/blob/master/pandas/api/extensions/__init__.py

Accessors: https://github.com/pandas-dev/pandas/blob/master/pandas/core/accessor.py

#420 has a possibly related discussion on pandas extension dtypes.

@HyukjinKwon HyukjinKwon added discussions enhancement New feature or request labels May 7, 2020
@achapkowski
Copy link

For pyspark dataframes, you can register custom accessors by doing the following. If this gets add it core, I believe it will work here too.

class CachedAccessor:
    """
    Custom property-like object (descriptor) for caching accessors.

    Parameters
    ----------
    name : str
        The namespace this will be accessed under, e.g. ``df.foo``
    accessor : cls
        The class with the extension methods.

    NOTE
    ----
    Modified based on pandas.core.accessor.
    """

    def __init__(self, name, accessor):
        self._name = name
        self._accessor = accessor

    def __get__(self, obj, cls):
        if obj is None:
            # we're accessing the attribute of the class, i.e., Dataset.geo
            return self._accessor
        accessor_obj = self._accessor(obj)
        # Replace the property with the accessor object. Inspired by:
        # http://www.pydanny.com/cached-property.html
        setattr(obj, self._name, accessor_obj)
        return accessor_obj


def _register_accessor(name, cls):
    """
    NOTE
    ----
    Modified based on pandas.core.accessor.
    """

    def decorator(accessor):
        if hasattr(cls, name):
            warnings.warn(
                "registration of accessor {!r} under name {!r} for type "
                "{!r} is overriding a preexisting attribute with the same "
                "name.".format(accessor, name, cls),
                UserWarning,
                stacklevel=2,
            )
        setattr(cls, name, CachedAccessor(name, accessor))
        return accessor

    return decorator


def register_dataframe_accessor(name):
    """
    NOTE
    ----
    Modified based on pandas.core.accessor.
    """
    try:
        from pyspark.sql import DataFrame
    except ImportError:
        import_message(
            submodule="spark",
            package="pyspark",
            conda_channel="conda-forge",
            pip_install=True,
        )

    return _register_accessor(name, DataFrame)


def register_dataframe_method(method):
    """Register a function as a method attached to the Pyspark DataFrame.

    NOTE
    ----
    Modified based on pandas_flavor.register.
    """

    def inner(*args, **kwargs):
        class AccessorMethod:
            def __init__(self, pyspark_obj):
                self._obj = pyspark_obj

            @wraps(method)
            def __call__(self, *args, **kwargs):
                return method(self._obj, *args, **kwargs)

        register_dataframe_accessor(method.__name__)(AccessorMethod)

        return method

    return inner()

Then call the code:

 @register_dataframe_accessor('amazingtimes')
    class AmazingNameDataFrameAccessor():
        def __init__(self, data):
            self._data = data
            print('foo')
        @property
        def hello(self):
            return 'pyspark accessor'
        def method(self, a=1):
            """this is a method example"""
            a += a
            return a 
        @property
        def columns(self):
            return self._data.schema.names

Usage

print(df.amazingtimes.hello)

@HyukjinKwon
Copy link
Member

Seems pretty good. @achapkowski are you interested in opening a PR? Let's make sure the documentation and usage is similar or same as pandas'.

@scook12
Copy link
Contributor Author

scook12 commented Jun 11, 2020

@achapkowski let me know if you're going to take this on - if not, I can take a look at it next week.

@HyukjinKwon
Copy link
Member

Please go ahead @scook12!

@scook12
Copy link
Contributor Author

scook12 commented Jun 20, 2020

Thanks @HyukjinKwon!

HyukjinKwon pushed a commit that referenced this issue Jul 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussions enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants