Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large files being split in output #144

Closed
petewritescode opened this issue Jun 30, 2023 · 3 comments · Fixed by #145
Closed

Large files being split in output #144

petewritescode opened this issue Jun 30, 2023 · 3 comments · Fixed by #145
Labels

Comments

@petewritescode
Copy link

Hi, this is a really useful library, so thank you!

I've hit an issue when concatenating large files, whereby the file contents get split up and mixed in with the content of other files.

Issue reproduction

I've created a codesandbox to reproduce the issue:

https://codesandbox.io/p/sandbox/clever-heisenberg-fkfxrt

The two input files in ./data are a.txt (containing thousands of lines of "a") and b.txt (containing a single "b"). There's a build script which runs globcat and outputs to ./dist/output.txt.

Expected outcome - I expect output.txt to contain all the content from a.txt followed by the content from b.txt.
Actual outcome - The content from b.txt is in the middle if the output, with thousands of "a"s before and after it.

Possible cause

I have limited experience with Node streams but I think the issue might lie here: https://github.com/smonn/globcat/blob/main/src/lib/files-to-stream.ts. Specifically this bit (shortened for clarity):

for (const file of files) {
  const stream = fs.createReadStream(file)
  passthrough = stream.pipe(passthrough, { end: false })
}

As I understand it, because this happens within a loop, it means that files are being streamed to passthrough simultaneously, so their order is not guaranteed. I imagine this is fine as long as each file is below the highWaterMark (64KiB by default) and is therefore processed as a single chunk.

However, files larger than this will be broken into chunks, and those chunks may be delayed and processed after chunks from a different file, making it possible for file content to be jumbled up as in my reproduction.

Possible fix

I've fixed this locally by writing chunks synchronously. As I say, I'm no expert on Node streams so I don't know whether there are any issues with this solution, but I've included it in case it's helpful:

for (const file of files) {
  const stats = await fsPromises.stat(file)
  if (stats.isFile()) {
    const stream = fs.createReadStream(file)
    stream.once('end', () => {
      queueSize--
      if (queueSize === 0) {
        passthrough.end()
      }
    })
    for await (const chunk of stream) {
      passthrough.write(chunk)
    }
  } else {
    throw new Error('Not a file: ' + file)
  }
}

Thanks again,

Pete

@Jason3S
Copy link

Jason3S commented Nov 28, 2023

@smonn,

Any movement on this issue?

I got bit by this bug when upgrading from 2.0.1 to 3.0.1.

There is also and issue with the limit on the number of active streams.

Jason3S added a commit to streetsidesoftware/cspell that referenced this issue Nov 28, 2023
@smonn
Copy link
Owner

smonn commented Dec 15, 2023

@Jason3S Thanks for this very detailed issue! I'll dive into it and get it sorted out.

smonn added a commit that referenced this issue Dec 15, 2023
smonn added a commit that referenced this issue Dec 15, 2023
* fix: Large files being split in output #144

version bumps etc

* add coverage dep

* 3.1.0-0

* 3.1.0

* fix coverage dep

* reverse tag temporarily

* fix some async tests

* fix pnpm ci usage

* 3.1.0-1
@smonn smonn added the bug label Dec 15, 2023
@smonn
Copy link
Owner

smonn commented Dec 15, 2023

Fixed in 3.1.0, let me know if it's still causing troubles. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants