Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible nbstripout-fast integration #179

Open
mlucool opened this issue Dec 10, 2022 · 6 comments
Open

Possible nbstripout-fast integration #179

mlucool opened this issue Dec 10, 2022 · 6 comments

Comments

@mlucool
Copy link

mlucool commented Dec 10, 2022

Coming from #33 (comment).

Hi,

This is a really excellent project, but like above I found it had too much overhead on larger repos. I wrote a pure rust version (with python bindings so it can be pip installed) located at https://github.com/deshaw/nbstripout-fast (happed to chose the same name as @stas00). My testing shows this is ~200x faster. Really intersted to hear your thoughts @kynan. This is not a true a 1:1 match, but all key features should be included and a few more added.

Is there a way in which this project could let users choose to use the rust version? Your install/setups scripts are great and clearly this project is very very popular. I think some sort of linkage there would be a net positive for the community.

@kynan

Is there a way in which this project could let users choose to use the rust version?

In theory yes, however doesn't it feel a little odd to have one tool install another?

Can you give more detail on what exactly you're thinking of? How do you see nbstripouts installation work when choosing nbstripout-fast ? I also think this discussion should be moved to a separate issue.

Rust is not ideal for everyone because 1) we can't have wheels for every OS and 2) it's harder to tinker with. I was thinking something along the lines of if this project added nbstripout-fast as an optional dependency that can gracefully fail then IFF nbstripout-fast installed and the options you want to use overlap (most do), it calls nbstripout-fast, if not it continues to call the python code (of course we can add a flag to force one or the other).

Mostly I just wanted to put the thought out there and see if there was any traction. I totally understand if this does not seem like something you want to do for this project as it's not a super clean design. Open to other ideas as well.

@arobrien
Copy link
Contributor

A few projects have optional dependencies on accelerator libraries. For example GeoPandas will use pygeos if available, otherwise fallback to alternative implementations (with configuration overrides). I think this serves as a good model for how this speedup can be made available to users who need it, without fracturing the ecosystem.

# _compat.py

INSTALL_PYGEOS_ERROR = "To use PyGEOS within GeoPandas, you need to install PyGEOS: \
'conda install pygeos' or 'pip install pygeos'"

try:
    import pygeos  # noqa

    # only automatically use pygeos if version is high enough
    if Version(pygeos.__version__) >= Version("0.8"):
        HAS_PYGEOS = True
        PYGEOS_GE_09 = Version(pygeos.__version__) >= Version("0.9")
        PYGEOS_GE_010 = Version(pygeos.__version__) >= Version("0.10")
    else:
        warnings.warn(
            "The installed version of PyGEOS is too old ({0} installed, 0.8 required),"
            " and thus GeoPandas will not use PyGEOS.".format(pygeos.__version__),
            UserWarning,
        )
        HAS_PYGEOS = False
except ImportError:
    HAS_PYGEOS = False


def set_use_pygeos(val=None):
    """
    Set the global configuration on whether to use PyGEOS or not.
    The default is use PyGEOS if it is installed. This can be overridden
    with an environment variable USE_PYGEOS (this is only checked at
    first import, cannot be changed during interactive session).
    Alternatively, pass a value here to force a True/False value.
    """
    global USE_PYGEOS
    global USE_SHAPELY_20
    global PYGEOS_SHAPELY_COMPAT

    env_use_pygeos = os.getenv("USE_PYGEOS", None)

    if val is not None:
        USE_PYGEOS = bool(val)
    else:
        if USE_PYGEOS is None:

            USE_PYGEOS = HAS_PYGEOS

            if env_use_pygeos is not None:
                USE_PYGEOS = bool(int(env_use_pygeos))

    # validate the pygeos version
    if USE_PYGEOS:
        try:
            import pygeos  # noqa

            # validate the pygeos version
            if not Version(pygeos.__version__) >= Version("0.8"):
                if SHAPELY_GE_20:
                    USE_PYGEOS = False
                    warnings.warn(
                        "The PyGEOS version is too old, and Shapely >= 2 is installed, "
                        "thus using Shapely by default and not PyGEOS."
                    )
                else:
                    raise ImportError(
                        "PyGEOS >= 0.8 is required, version {0} is installed".format(
                            pygeos.__version__
                        )
                    )

            # Check whether Shapely and PyGEOS use the same GEOS version.
            # Based on PyGEOS from_shapely implementation.

            from shapely.geos import geos_version_string as shapely_geos_version
            from pygeos import geos_capi_version_string

            # shapely has something like: "3.6.2-CAPI-1.10.2 4d2925d6"
            # pygeos has something like: "3.6.2-CAPI-1.10.2"
            if not shapely_geos_version.startswith(geos_capi_version_string):
                warnings.warn(
                    "The Shapely GEOS version ({}) is incompatible with the GEOS "
                    "version PyGEOS was compiled with ({}). Conversions between both "
                    "will be slow.".format(
                        shapely_geos_version, geos_capi_version_string
                    )
                )
                PYGEOS_SHAPELY_COMPAT = False
            else:
                PYGEOS_SHAPELY_COMPAT = True

        except ImportError:
            raise ImportError(INSTALL_PYGEOS_ERROR)

    if USE_PYGEOS and env_use_pygeos is None and SHAPELY_GE_20:
        warnings.warn(
            "Shapely 2.0 is installed, but because PyGEOS is also installed, "
            "GeoPandas will still use PyGEOS by default for now. To force to use and "
            "test Shapely 2.0, you have to set the environment variable USE_PYGEOS=0. "
            "You can do this before starting the Python process, or in your code "
            "before importing geopandas:"
            "\n\nimport os\nos.environ['USE_PYGEOS'] = '0'\nimport geopandas\n\n"
            "In a future release, GeoPandas will switch to using Shapely by default. "
            "If you are using PyGEOS directly (calling PyGEOS functions on geometries "
            "from GeoPandas), this will then stop working and you are encouraged to "
            "migrate from PyGEOS to Shapely 2.0 "
            "(https://shapely.readthedocs.io/en/latest/migration_pygeos.html).",
            stacklevel=6,
        )

    USE_SHAPELY_20 = (not USE_PYGEOS) and SHAPELY_GE_20

set_use_pygeos()

I think that the (having not done any profiling of my own....) core slowdowns are likely happening in the file read/write, and in the strip_output() method (along with the equivalent zeppelin method. strip_output() isn't really too complicated, so maintaining two parallel implementations wouldn't be too hard, given a robust enough test suite. I want to restructure the code to be a little bit more modular in the area, including creating a StripOptions class or similar, anyway, which would facilitate a clearer and more robust interface between these two projects.

@mlucool
Copy link
Author

mlucool commented Dec 11, 2022

I think that the (having not done any profiling of my own....) core slowdowns are likely happening in the file read/write, and in the strip_output() method

My testing showed even starting python was too slow if it needed to be done 100s of times (once per file). That being said, there is also a git protocol to start the binary once and reuse that proces; I could see this python overhead time being negligible then*. Doing this easily depends on maturin support for shipping both python and bin (PyO3/maturin#368), calling nbstripout-fast from python is possible. Today it's setup to use python bindings during unit testing and creating a bin with wheels for release.

*This may speedup nbstripout without nbstripout-fast

@kynan
Copy link
Owner

kynan commented Dec 11, 2022

I think to automatically select nbstripout-fast when installed we'd need to ensure that both really are compatible and nbstripout's tests aren't nearly robust enough to make this possible.

I'd prefer to go about this the other way around: how about we refactor the installation commands to be modular enough so nbstripout-fast can reuse them without reinventing the wheel? Would that work for you?

@mlucool
Copy link
Author

mlucool commented Dec 11, 2022

how about we refactor the installation commands to be modular enough so nbstripout-fast can reuse them without reinventing the wheel? Would that work for you?

That seems like a good starting point given where the projects are at now

@kynan
Copy link
Owner

kynan commented Dec 12, 2022

If you want to contribute to this refactoring let me know. Might be able to find some time over the holiday break otherwise.

@mlucool
Copy link
Author

mlucool commented Dec 13, 2022

Happy to review if you do find some time to work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants