-
-
Notifications
You must be signed in to change notification settings - Fork 689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Cython Support to Arrow #1143
base: master
Are you sure you want to change the base?
Conversation
Co-Authored by: Lukas Lemke <[email protected]>
Co-Authored by: Lukas Lemke <[email protected]>
Co-Authored by: Lukas Lemke <[email protected]> - since cython doesn't support changing builtins at runtime
Co-Authored by: Lukas Lemke <[email protected]>
Co-Authored by: Lukas Lemke <[email protected]>
Co-Authored by: Lukas Lemke <[email protected]>
Very nice, i had a wip branch of this, but this looks like good progress. |
Co-Authored by: Lukas Lemke <[email protected]>
Co-Authored by: Lukas Lemke <[email protected]> - automatically compile cython when running make build3x - add target to clean .c and binary files produced from compiling cython
This reverts commit 64fdcec.
We realized the system tests are failing because the test systems aren't installing Cython before running setup.py. It seems that the continuous_integration.yml is using cached requirements and therefore not installing Cython even though we specified it in requirements.txt? We are not exactly sure and could use some help with this. Cython seems to install fine when we do "make build3x" locally. |
tox handles setup.py dependencies a bit differently, which we install during the GH Actions. We should probably extract those out of the workflow and in a separate requirements file. |
Codecov ReportBase: 100.00% // Head: 99.59% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #1143 +/- ##
===========================================
- Coverage 100.00% 99.59% -0.41%
===========================================
Files 9 9
Lines 2319 2684 +365
Branches 492 0 -492
===========================================
+ Hits 2319 2673 +354
- Misses 0 11 +11
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@13MK3 I see 3.8+ is passing, the rest are failing. Great work. @anishnya @jadchaar @systemcatch how do we want to handle this going forward? do we want to go full cython or offer a pure python version as well? |
Pull Request Checklist
Thank you for taking the time to improve Arrow! Before submitting your pull request, please check all appropriate boxes:
tox
ormake test
to find out!).tox -e lint
ormake lint
to find out!).master
branch.If you have any questions about your code changes or any of the points above, please submit your questions along with the pull request and we will try our best to help!
Description of Changes
These changes were made in collaboration with @khanm3
This pull request adds Cython support to Arrow as requested by #617.
Further Explanation
In arrow/arrow.py, we had to change the variable "locale" to "locale_cls" because the variable was being reused for 2 different types and this functionality was incompatible with Cython. Cython also had difficulties recognizing the type of _datetime.isoclanedar, so we had to specifically cast it to a tuple. Without this, it would not compile.
Even with the issues mentioned above, we chose to use Cython 3.0.0a11 (which is a pre-release version) instead of Cython 0.29.32 (the most recent stable version), since from our testing, Cython 3 is better at parsing modern Python features. In general, Cython tries to subsume the functionality of python, but as you can see there are still many quirks.
We were able to allow Cython to use the existing method of calculating code coverage by adding linetracing as a compiler directive. However, due to the quirks of Cython, the resulting line coverage results are slightly lower than 100%. We found that the lines that were no longer being covered were one of 2 cases:
This could be ignored by adding more 'pragma no cover', but we're not sure if there is a better way.
Performance Analysis
To compare performance of Arrow with and without Cython, we ran just the Arrow part of this benchmark. This benchmark runs 1 million executions per benchmark and picked the min of 5 repeats. This was run in Python 3.8.10.
From these results, you can see that compiling Arrow with Cython does cause a measureable improvement to performance, especially with parsing. Additionally, compiling Cython without linetracing is even more performant, but we currently compile with it so that code coverage can be calculated.
Future Work
Adding static typing (and necessarily converting those files to .pyx files) would bring an even bigger boost to performance. Parser.py should receive special attention, as this is where the largest amount of time is being spent. This is where adding Cython to Arrow could prove to be especially valuable, since the speed increase with static typing should be much more than that gained by simply compiling with Cython. Looking into different ways of calculating code coverage so linetracing is no longer needed, or providing an option to switch the compiler directive would also be a good way to improve performance in the future.