Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jwarc failing to read Http body at the second pass of a file #80

Closed
vlofgren opened this issue Dec 12, 2023 · 3 comments
Closed

jwarc failing to read Http body at the second pass of a file #80

vlofgren opened this issue Dec 12, 2023 · 3 comments
Labels

Comments

@vlofgren
Copy link

I'm running into an issue where the WarcReader behaves differently when reading the same file twice. It seems there is some global state that is not appropriately cleaned up perhaps. In the second pass, http().body().isOpen() returns false, when it returned true in the first pass; and attempting to read the data fails.

I'm on jwarc 0.28.4.

This is the smallest test case I've been able to produce:

    @Test
    public void spookyActionAtADistance() throws IOException {
        Path tempFile = Files.createTempFile("test", ".warc");
        try {
            StringBuilder sb = new StringBuilder();
            
            sb.append("WARC/1.0\r\n");
            sb.append("WARC-Type: revisit\r\n");
            sb.append("WARC-Target-URI: https://www.example.com/\r\n");
            sb.append("WARC-Date: 2023-12-12T20:31:13Z\r\n");
            sb.append("WARC-Record-ID: <urn:uuid:ee40237d-5fb4-41a7-aa2e-bfb67c57f5f0>\r\n");
            sb.append("Content-Length: 134\r\n");
            sb.append("Content-Type: application/http;msgtype=response\r\n");
            sb.append("WARC-Payload-Digest: sha1:KC62VKWCVTBH5CT636VIFBP3KTLP4UZM\r\n");
            sb.append("WARC-Block-Digest: sha1:T2C4UJZP7PHK75RWHCID3GDDKWX5J5UX\r\n");
            sb.append("\r\n");
            sb.append("HTTP/1.1 200 OK\r\n");
            sb.append("Content-Type: text/html\r\n");
            sb.append("Content-Length: 45\r\n");
            sb.append("Content-Encoding: UTF-8\r\n");
            sb.append("\r\n");
            sb.append("<?doctype html><html><body>test</body></html>");

            Files.writeString(tempFile, sb.toString());

            try (var reader = new WarcReader(tempFile)) {
                for (var record : reader) {
                    WarcRevisit revisit = (WarcRevisit) record;
                    var http = revisit.http();
                    try (var body = http.body()) {
                        System.out.println(body.isOpen()); // prints true
                        body.stream().readAllBytes(); // works fine
                    }
                }
            }

            try (var reader = new WarcReader(tempFile)) {
                for (var record : reader) {
                    WarcRevisit revisit = (WarcRevisit) record;
                    var http = revisit.http();
                    try (var body = http.body()) {
                        System.out.println(body.isOpen());  // prints false
                        body.stream().readAllBytes();  // exception !!
                    }
                }
            }

        }
        finally {
            Files.deleteIfExists(tempFile);
        }
    }
@ato ato closed this as completed in 35bae7b Dec 13, 2023
ato added a commit that referenced this issue Dec 13, 2023
Bugs fixed

* Fixed ClosedChannelException when reading a WarcRevisit body
  after closing a previous one due to reuse of empty MessageBody. #80
@ato ato added the bug label Dec 13, 2023
@ato
Copy link
Member

ato commented Dec 13, 2023

Oops. Thanks for reporting this! The same empty MessageBody instance was being reused which of course breaks if it gets closed.

There might be a secondary issue here in that jwarc currently assumes all revisit records have no payload but perhaps that's a false assumption.

@ato
Copy link
Member

ato commented Dec 13, 2023

Fix released as v0.28.5. It should sync to maven central in a couple of hours.

@vlofgren
Copy link
Author

Thanks for the quick fix :D

I'll work around the revisit limitations now that I know they exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants