Faster xdr reader #32

trossi · 2023-12-05T12:10:51Z

This PR adds faster reader for files in xdr format. Full arrays are read directly with numpy instead of reading element by element. As a positive side effect, deprecated xdrlib isn't needed anymore.

Related to #31. I have cherry-picked and rebased commits related to xdr reader improvements to this PR. There are also some structural changes that are open for discussion, for example, xdr reader is moved to rdata/io/xdr.py to simplify separation between different readers like (upcoming) rdata/io/ascii.py. @vnmabus Could you review?

rdata/io/xdr.py

codecov-commenter · 2023-12-05T12:12:47Z

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (d4c1d83) 92.02% compared to head (3a4631e) 91.98%.
Report is 11 commits behind head on develop.

Files	Patch %	Lines
rdata/parser/_parser.py	96.15%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop      #32      +/-   ##
===========================================
- Coverage    92.02%   91.98%   -0.05%     
===========================================
  Files            6        7       +1     
  Lines         1104     1086      -18     
===========================================
- Hits          1016      999      -17     
+ Misses          88       87       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

trossi · 2023-12-08T09:26:08Z

Here is approximate timing data for reference:

Array size (MiB)	Time to read before this PR (s)	Time to read with this PR (s)
16	1.2	0.3
32	2.2	0.3
64	4.2	0.3
128	8.0	0.4
256	18.5	0.5
512	39.4	0.8
1024	77.3	1.5

This data was created with the following (in bash):

Generate test data (without compression to skip time spent in decompression): for i in {1..7}; do n=$(( 2 ** $i )); Rscript -e "saveRDS(runif(n=$n*1024**2), file='array_$i.rds', compress=FALSE)"; done
Read and measure time: for i in {1..7}; do echo $i; time -p python -c "from rdata.parser import parse_file; parse_file('array_$i.rds')"; done

LICENSE

pyproject.toml

rdata/io/xdr.py

rdata/parser/_parser.py

Change copyright notice to make it clear that the copyright is shared between contributors.

vnmabus

Sorry for the delay in accepting this. I was on vacation and had a "forced" digital detox. Approving and merging now.

pyproject.toml

trossi · 2024-01-16T10:37:39Z

Sorry for the delay in accepting this. I was on vacation and had a "forced" digital detox. Approving and merging now.

No problem, thank you for merging! I hope you had a relaxing vacation!

I'll open a PR for ASCII reader next.

vnmabus · 2024-01-25T14:06:35Z

Here is approximate timing data for reference:

Just to let you know: I added your example (but limited to 5 iterations) as an asv benchmark to the package, to check for future performance regressions. I also added a new testing module (currently undocumented) to retrieve and execute R snippets from strings, so that each test can have its associated R snippet for creating the data, instead of a big script for all.

trossi added 12 commits November 29, 2023 13:53

Add copyright notice

24eac25

Add faster readers for double and complex arrays

792df94

Add faster reader for int arrays

26be6ef

Do not use xdrlib (to be removed in Python 3.13)

79d9484

Fix complex dtype

b3c36a6

Use file-like object to simplify reading

285f707

Fix mypy and flake8 issues

330a74b

Move xdr parser to a separate file

9b4e578

Avoid unnecessary copy

b44b7d5

Add authors

6d5b419

Use dtype object instead of itemkind and itemsize

074a825

Clarify array value reading

afb0d0e

trossi commented Dec 5, 2023

View reviewed changes

rdata/io/xdr.py Outdated Show resolved Hide resolved

vnmabus requested changes Dec 12, 2023

View reviewed changes

trossi and others added 7 commits December 18, 2023 14:20

Move xdr parser to parser directory

ab24a7e

Remove duplicate definition of R_INT_NA

5104fa4

Add parentheses for clarity

b48d0cc

Pass memoryview to xdr parser

d2d318c

Declare _parse_array_values() as abstract

10ba6de

Clean up imports and docstring

3a4631e

Update LICENSE

ad85c61

Change copyright notice to make it clear that the copyright is shared between contributors.

vnmabus approved these changes Jan 16, 2024

View reviewed changes

pyproject.toml Show resolved Hide resolved

vnmabus merged commit f81ed84 into vnmabus:develop Jan 16, 2024
14 of 15 checks passed

trossi mentioned this pull request Sep 4, 2024

Add JOSS paper. #43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster xdr reader #32

Faster xdr reader #32

trossi commented Dec 5, 2023

codecov-commenter commented Dec 5, 2023 •

edited

Loading

trossi commented Dec 8, 2023

vnmabus left a comment

trossi commented Jan 16, 2024

vnmabus commented Jan 25, 2024

Faster xdr reader #32

Faster xdr reader #32

Conversation

trossi commented Dec 5, 2023

codecov-commenter commented Dec 5, 2023 • edited Loading

Codecov Report

trossi commented Dec 8, 2023

vnmabus left a comment

Choose a reason for hiding this comment

trossi commented Jan 16, 2024

vnmabus commented Jan 25, 2024

codecov-commenter commented Dec 5, 2023 •

edited

Loading