Remove dependency on ZipFile and remove GC call from read #273

TimG1964 · 2024-10-04T16:53:32Z

Replacing ZipFile with ZipArchive has also removed the need for the GC call in read.

To make this work, I had to set the `enable_cache` kwarg to `true` rather than false for `readdata()` and `readtable()`

These tests will fail with `enable_cache=false` for both `readdata()` and `readtable()` (as in the current master). This PR changes this kwarg for these functions to `true`.

codecov · 2024-10-04T16:57:04Z

Codecov Report

Attention: Patch coverage is 92.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 95.06%. Comparing base (f4767c4) to head (7600fd5).
Report is 6 commits behind head on master.

Files with missing lines	Patch %	Lines
src/read.jl	90.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #273      +/-   ##
==========================================
+ Coverage   95.02%   95.06%   +0.04%     
==========================================
  Files          15       15              
  Lines        2009     1985      -24     
==========================================
- Hits         1909     1887      -22     
+ Misses        100       98       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

TimG1964 · 2024-10-04T17:12:33Z

I'm not sure I understand the failed test on codecov. If I look at the details, 100% of my new code is covered by the tests. All the code that is not covered was there before this PR. Despite this, overall coverage has gone down and the test has failed!

nhz2

Some comments on how performance can be improved.

src/read.jl

nhz2 · 2024-10-09T14:10:44Z

src/read.jl

                xf.files[filename] = true # set file as read

                try
-                    xf.data[filename] = EzXML.readxml(f)
+                    xf.data[filename] = EzXML.parsexml(ZipArchives.zip_readentry(xf.io, f, String))
                catch err
                    @error("Failed to parse internal XML file `$filename`")
                    rethrow()


The whole for loop and file_not_found check can be simplified to:

try xf.data[filename] = EzXML.parsexml(ZipArchives.zip_readentry(xf.io, filename)) catch err @error("Failed to parse internal XML file `$filename`") rethrow() end

Because zip_readentry will take care of looping through the entry names, reading a matching entry, or erroring if the filename isn't found.

Thank you very much for all these suggestions. Where possible, I've accepted them here, as you can see.

Your last suggestion also seems to need the line

xf.files[filename] = true # set file as read

in the try block, otherwise the file is recorded as not present.
I've added this to my fork and will figure out how to include it here.

src/stream.jl

This should be faster because it avoids creating a String and there is a check of the uncompressed_size in https://github.com/JuliaIO/ZipArchives.jl/blob/f955785e237a0a8b3607cf651eaebc1eb1037b8c/src/reader.jl#L344 Co-authored-by: Nathan Zimmerberg <[email protected]>

zip_openentry can be used here to avoid decompressing the entire entry into memory. Also, the error on the line after this can be removed with this change. Co-authored-by: Nathan Zimmerberg <[email protected]>

This should be faster because it avoids allocating all of the entry names at once. Co-authored-by: Nathan Zimmerberg <[email protected]>

TimG1964 · 2024-10-09T15:17:18Z

One reflection here is that my proposed changes have removed the mmap option to read in place. I don't know enough to be sure but I guess this would be a breaking change for some use cases (large files). It wasn't a deliberate omission!

I am working on adding it back in but, unfortunately, It has caused the original problem to arise, namely, cannot write after read:

SystemError: opening file "output_tables.xlsx": Invalid argument

I've spent some time on this now but - for me at least - it seems pretty intractable! :-(

nhz2 · 2024-10-09T16:42:35Z

The segfaults are very strange, it seems like there is a bug in EzXML.jl.

I wonder if it would be possible to switch to https://github.com/JuliaComputing/XML.jl. This would help fix the multi-threading issues as well.

TimG1964 · 2024-10-09T17:02:05Z

Gulp!
Day job permitting I might have a go at this but, so far, the more I dig the more of a pickle I seem to get into!
Don't hold your breath!

TimG1964 · 2024-10-11T15:20:37Z

Just found this on the hdf5.jl docs

Note: if you use readmmap on a dataset and subsequently close the file, the array data are still available---and file continues to be in use---until all of the arrays are garbage-collected.

Might this be something that is going on here?

TimG1964 added 14 commits October 3, 2024 23:27

Remove the gc call in read.jl

8063a8a

To make this work, I had to set the `enable_cache` kwarg to `true` rather than false for `readdata()` and `readtable()`

Added tests for rm after readdata and readtable

7ccb887

These tests will fail with `enable_cache=false` for both `readdata()` and `readtable()` (as in the current master). This PR changes this kwarg for these functions to `true`.

Remove remaining dependence on ZipFiles

3dc95bc

Remove remaining dependency on ZipFiles

fc3f8b1

Remove any remaining dependence on ZipFile

281d5c1

Remove last trace of ZipFile

7343cab

Remove dependency on ZipFile

e0979b3

Update types.jl

b189ea1

Remove ZipFile

d59aa69

Remove ZipFile and gc call on read

55b570b

Remove ZipFile

14a30c6

Remove ZipFile and gc call on read

f54349a

Remove ZipFile

acc6efb

Remove ZipFile

d81e672

nhz2 reviewed Oct 9, 2024

View reviewed changes

TimG1964 and others added 3 commits October 9, 2024 15:53

Following suggestion from @nhz2

ffdc969

zip_openentry can be used here to avoid decompressing the entire entry into memory. Also, the error on the line after this can be removed with this change. Co-authored-by: Nathan Zimmerberg <[email protected]>

Following suggestion from @nhz2

37eb14f

This should be faster because it avoids allocating all of the entry names at once. Co-authored-by: Nathan Zimmerberg <[email protected]>

nhz2 mentioned this pull request Oct 10, 2024

Using StreamReader with TranscodingStreams leads to random Segmentation faults JuliaIO/EzXML.jl#200

Open

Following suggestion by @nhz2

7600fd5

TimG1964 closed this Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove dependency on ZipFile and remove GC call from read #273

Remove dependency on ZipFile and remove GC call from read #273

TimG1964 commented Oct 4, 2024

codecov bot commented Oct 4, 2024 •

edited

Loading

TimG1964 commented Oct 4, 2024

nhz2 left a comment

nhz2 Oct 9, 2024

TimG1964 Oct 9, 2024 •

edited

Loading

TimG1964 commented Oct 9, 2024 •

edited

Loading

nhz2 commented Oct 9, 2024

TimG1964 commented Oct 9, 2024

TimG1964 commented Oct 11, 2024

Remove dependency on ZipFile and remove GC call from read #273

Remove dependency on ZipFile and remove GC call from read #273

Conversation

TimG1964 commented Oct 4, 2024

codecov bot commented Oct 4, 2024 • edited Loading

Codecov Report

TimG1964 commented Oct 4, 2024

nhz2 left a comment

Choose a reason for hiding this comment

nhz2 Oct 9, 2024

Choose a reason for hiding this comment

TimG1964 Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

TimG1964 commented Oct 9, 2024 • edited Loading

nhz2 commented Oct 9, 2024

TimG1964 commented Oct 9, 2024

TimG1964 commented Oct 11, 2024

codecov bot commented Oct 4, 2024 •

edited

Loading

TimG1964 Oct 9, 2024 •

edited

Loading

TimG1964 commented Oct 9, 2024 •

edited

Loading