Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize array-from-ctypes in basic.py #3927

Merged
merged 1 commit into from
Feb 17, 2021

Commits on Feb 8, 2021

  1. Optimize array-from-ctypes in basic.py

    Approximately %80 of runtime when loading "low column count, high row
    count" DataFrames into Datasets is consumed in `np.fromiter`, called
    as part of the `Dataset.get_field` method.
    
    This is particularly pernicious hotspot, as unlike other ctypes-based
    methods this is a hot loop over a python iterator loop and causes
    significant GIL-contention in multi-threaded applications.
    
    Replace `np.fromiter` with a direct call to `np.ctypeslib.as_array`,
    which allows a single-shot `copy` of the underlying array.
    
    This reduces the load time of a ~35 million row categorical dataframe
    with 1 column from ~5 seconds to ~1 second, and allows multi-threaded
    execution.
    asford committed Feb 8, 2021
    Configuration menu
    Copy the full SHA
    a24d3f8 View commit details
    Browse the repository at this point in the history