Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wc: streaming --files0-from and other improvements #4696

Merged
merged 6 commits into from
May 27, 2023
Merged

Conversation

jeddenlea
Copy link
Contributor

@jeddenlea jeddenlea commented Mar 31, 2023

My original focus was on --files0-from, which should be handled as a stream if the file specified (or stdin) is not a regular file. I've accomplished this by making a new Inputs type to represent the desired list of files to process. Its try_iter method does most of the heavy lifting to figure out how to actually run.

Secondarily, I have attempted to reduce the number of String allocations that occur while printing the results. Now, unless a file name needs escaping, or an error occurs, no additional allocations should be necessary to print results. An Input now borrows file names from the command line, unless they're read from --files0-from. print_stats was the biggest abuser, allocating a String for each column before before being join'd into a String for the whole line. The TitledWordCount type is not necessary at all, which was another source of String allocation.

Finally, I've made some effort to make more cases match GNU wc's output. Errors encountered processing --files0-from as well as any encountered processing files listed on the command line will be escaped more consistently like GNU wc. File names printed to the right of their stats will be less aggressively quoted, to match GNU wc, now only if they are not UTF-8 or if they contain a newline.

@github-actions
Copy link

github-actions bot commented Mar 31, 2023

GNU testsuite comparison:

GNU test failed: tests/misc/wc-files0. tests/misc/wc-files0 is passing on 'main'. Maybe you have to rebase?

@cakebaker
Copy link
Contributor

Can you please run cargo clippy? Thanks!

@jeddenlea
Copy link
Contributor Author

Yes. Apologies, I thought I had!

@github-actions
Copy link

GNU testsuite comparison:

GNU test failed: tests/misc/wc-files0. tests/misc/wc-files0 is passing on 'main'. Maybe you have to rebase?

@sylvestre
Copy link
Contributor

could you please split the PR into smaller commits ? This is a big patch ...

@github-actions
Copy link

github-actions bot commented Apr 1, 2023

GNU testsuite comparison:

GNU test failed: tests/misc/wc-files0. tests/misc/wc-files0 is passing on 'main'. Maybe you have to rebase?
GNU test failed: tests/tail-2/inotify-dir-recreate. tests/tail-2/inotify-dir-recreate is passing on 'main'. Maybe you have to rebase?

@jeddenlea
Copy link
Contributor Author

jeddenlea commented Apr 1, 2023

I've broken up the change into 4 different commits. The lines between some concerns were a little blurry, as I ended up touching a lot of stuff, but I've tried my best to keep them each cleanly readable on their own.

I'm still working on getting the GNU coreutils tests running completely, but I've at least got the one complaining to run. And, it turns out wc will read the entirety of a --files0-from file before processing so long as it is no more than 10 MiB, so I need to do a little refactoring on the last commit...

@jeddenlea jeddenlea marked this pull request as draft April 1, 2023 17:49
@jeddenlea jeddenlea force-pushed the wc branch 2 times, most recently from 7c5c68c to 9f6b788 Compare April 3, 2023 06:00
@github-actions
Copy link

github-actions bot commented Apr 3, 2023

GNU testsuite comparison:

GNU test failed: tests/misc/wc-parallel. tests/misc/wc-parallel is passing on 'main'. Maybe you have to rebase?
GNU test failed: tests/tail-2/inotify-dir-recreate. tests/tail-2/inotify-dir-recreate is passing on 'main'. Maybe you have to rebase?

@sylvestre
Copy link
Contributor

please remove the "draft" when ready to be reviewed :)

@jeddenlea jeddenlea marked this pull request as ready for review April 4, 2023 06:05
@jeddenlea
Copy link
Contributor Author

Alright, I think I'm ready for some feedback to iterate on here. On one point in particular, I didn't realize that when I introduced use of AsRawFd/FromRawFd that the import paths were new as of 1.66. Also, the handy ilog10 method for determining how many digits an int will need for printing was only added in 1.67! I see the current minimum is 1.64, which is pretty recent compared to some other repos' minimum version. What's your policy on updating the minimum version of Rust?

@tertsdiepraam
Copy link
Member

What's your policy on updating the minimum version of Rust?

We do it only after cutting a release and check how important the new features are. Essentially we bump it by need. But always in a separate PR.

Copy link
Member

@tertsdiepraam tertsdiepraam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty cool stuff! I would suggest that you keep patches much smaller in the future though.

src/uu/wc/src/wc.rs Outdated Show resolved Hide resolved
src/uu/wc/src/wc.rs Outdated Show resolved Hide resolved
src/uu/wc/src/wc.rs Outdated Show resolved Hide resolved
src/uu/wc/src/wc.rs Outdated Show resolved Hide resolved
src/uu/wc/src/wc.rs Outdated Show resolved Hide resolved
src/uu/wc/src/wc.rs Show resolved Hide resolved
show_bytes: bool,
show_chars: bool,
show_lines: bool,
show_words: bool,
show_max_line_length: bool,
files0_from_path: Option<PathBuf>,
files0_from: Option<Input<'a>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make things more complicated for performance reasons (which I assume the lifetimes are for), you do kind of need to show that performance improves. I bet it doesn't really matter, because it happens just once and this is not a hot path. The same applies to some of the Cow strings. If you just get rid of cloning and do not introduce complexity it's always fine of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get perhaps irrationally obsessive about avoiding unnecessary allocations.... But, I was actually pleased with how simple this whole scheme ended up being. In my mind, 'a was just short for "command-line Args", which is what we're aiming to borrow from here... I almost came up with more complicated schemes with other lifetimes specified, but they weren't really helpful.

src/uu/wc/src/wc.rs Outdated Show resolved Hide resolved
src/uu/wc/src/wc.rs Outdated Show resolved Hide resolved
Copy link
Contributor Author

@jeddenlea jeddenlea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! I've done my best to address your concerns within each of the 4 commits, and I've rebased them all atop main.

src/uu/wc/src/wc.rs Outdated Show resolved Hide resolved
src/uu/wc/src/wc.rs Outdated Show resolved Hide resolved
src/uu/wc/src/wc.rs Outdated Show resolved Hide resolved
src/uu/wc/src/wc.rs Show resolved Hide resolved
src/uu/wc/src/wc.rs Outdated Show resolved Hide resolved
src/uu/wc/src/wc.rs Outdated Show resolved Hide resolved
src/uu/wc/src/wc.rs Outdated Show resolved Hide resolved
show_bytes: bool,
show_chars: bool,
show_lines: bool,
show_words: bool,
show_max_line_length: bool,
files0_from_path: Option<PathBuf>,
files0_from: Option<Input<'a>>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get perhaps irrationally obsessive about avoiding unnecessary allocations.... But, I was actually pleased with how simple this whole scheme ended up being. In my mind, 'a was just short for "command-line Args", which is what we're aiming to borrow from here... I almost came up with more complicated schemes with other lifetimes specified, but they weren't really helpful.

@sylvestre
Copy link
Contributor

See this failure:


     Running `sccache rustc --crate-name uu_relpath --edition=2021 src/uu/relpath/src/relpath.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=2 -C metadata=9e8dee9d6aa6280e -C extra-filename=-9e8dee9d6aa6280e --out-dir /home/runner/work/coreutils/coreutils/target/debug/deps -L dependency=/home/runner/work/coreutils/coreutils/target/debug/deps --extern clap=/home/runner/work/coreutils/coreutils/target/debug/deps/libclap-e00c16ba6075349e.rmeta --extern uucore=/home/runner/work/coreutils/coreutils/target/debug/deps/libuucore-1bf9a7e8458cc252.rmeta -Awarnings -L native=/home/runner/work/coreutils/coreutils/target/debug/build/blake3-ca32ffe4cb3ec7ed/out -L native=/home/runner/work/coreutils/coreutils/target/debug/build/blake3-ca32ffe4cb3ec7ed/out`
error[E0599]: no method named `ilog10` found for type `u64` in the current scope
   --> src/uu/wc/src/wc.rs:719:46
    |
719 |                 let total_width = (1 + total.ilog10())
    |                                              ^^^^^^ help: there is an associated function with a similar name: `log`

For more information about this error, try `rustc --explain E0599`.
error: could not compile `uu_wc` due to previous error

Caused by:

@jeddenlea
Copy link
Contributor Author

Hey Sylvestere. Yeah, that's what prompted my question about your policy around minimum versions.... If you anticipate moving to 1.67 soon, I'd leave that in place. Otherwise, we'll need to do something different. It's unfortunate Rust hasn't stabilized its version check, https://doc.rust-lang.org/beta/unstable-book/language-features/cfg-version.html

src/uu/wc/src/wc.rs Show resolved Hide resolved
src/uu/wc/src/wc.rs Outdated Show resolved Hide resolved
let input = match maybe_input {
Ok(input) => input,
Err(err) => {
record_error!("{err}");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
record_error!("{err}");
show!(e);

(This might need an into)

So instead of io::Error -> String -> USimpleError, we do io::Error -> UIoError.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's slick! So now not only do we get to punt on the naming of show, but we still get to make each case of it into a single line which was the whole point of record_error to begin with.

src/uu/wc/src/wc.rs Outdated Show resolved Hide resolved
src/uu/wc/src/wc.rs Outdated Show resolved Hide resolved
src/uu/wc/src/wc.rs Show resolved Hide resolved
src/uu/wc/src/wc.rs Show resolved Hide resolved
Comment on lines +177 to +180
fn try_iter(
&'a self,
settings: &'a Settings<'a>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure of functions here is what gets you into trouble with a lot of references and the Cow in Input. What if you make Input always a reference and then structure it like this:

fn count_inputs(inputs: Inputs) -> ... {
    match inputs {
        Inputs::Stdin => count_input(Input::Stdin(StdinKind::Implicit)),
        Inputs::Paths(paths) => /* call count_input for each path */,
        Self::Files0From(path) => /* call count_input for each line */
}

Then you don't have to pass Iterators around which makes lifetimes a lot easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iterator returned by try_iter takes care of so much, though. It also helps keep the flow of the program consistent with what was happening before. It used to be that it would only iterate through a [Input], then we couldn't handle streams. I wanted to be able to handle streams from --files0-from in the same manner as a list of command line arguments, and it has also made it easy to consistently handle errors like empty file names.

I should have mentioned already, and I'm embarrassed I still haven't gotten to it yet, but it was my intent to write some more tests to capture the new cases in which this change makes this wc act like GNU's. The formatting of the errors when empty paths are found in either a --files0-from or the command line is one "feature" we get with this.

@tertsdiepraam
Copy link
Member

If you anticipate moving to 1.67 soon

I think not just for this.

It's unfortunate Rust hasn't stabilized its version check

Honestly, I'm glad 😄 At least for this project, keeping a consistent MSRV would always be the best option. For libraries, it might be different of course.

@jeddenlea
Copy link
Contributor Author

If you anticipate moving to 1.67 soon

I think not just for this.

Oh sure, this is not a killer feature worth it on its own. I'll see about copying it in verbatim if it's short, or writing a facsimile...

It's unfortunate Rust hasn't stabilized its version check

Honestly, I'm glad 😄 At least for this project, keeping a consistent MSRV would always be the best option. For libraries, it might be different of course.

Those jerks just keep adding useful stuff to the standard library! This is certainly not the first time I stumbled onto something useful that just happened to be very new and caused this problem.

@sylvestre
Copy link
Contributor

With this PR: #4460 i think it needs a rebase :)
sorry

@sylvestre
Copy link
Contributor

@jeddenlea on android:

 error: pathspec './tests/fixtures/wc/alice' did not match any file(s) known to git
error: pathspec 'in' did not match any file(s) known to git

rings a bell ?

@jeddenlea
Copy link
Contributor Author

@jeddenlea on android:

 error: pathspec './tests/fixtures/wc/alice' did not match any file(s) known to git
error: pathspec 'in' did not match any file(s) known to git

rings a bell ?

I had made a symlink named alice in wonderland.txt that pointed to alice_in_wonderland.txt, hoping to emphasize that their contents were the same. But, that Android script is not friendly to symlinks with whitespace in their name. So, to keep things simple, I'll just make alice in wonderland.txt a file too.

tests/by-util/test_wc.rs Outdated Show resolved Hide resolved
@github-actions
Copy link

GNU testsuite comparison:

GNU test failed: tests/tail-2/inotify-dir-recreate. tests/tail-2/inotify-dir-recreate is passing on 'main'. Maybe you have to rebase?

@github-actions
Copy link

GNU testsuite comparison:

GNU test failed: tests/tail-2/inotify-dir-recreate. tests/tail-2/inotify-dir-recreate is passing on 'main'. Maybe you have to rebase?

@jeddenlea
Copy link
Contributor Author

The GnuTest failures look like a hiccup with apt. Hopefully the Windows tests were also just bad luck...

@sylvestre
Copy link
Contributor

nope, I think the test doesn't work on windows. see:

  TRY 1 SLOW [>4800.000s] coreutils::tests test_wc::test_files0_progressive_stream
   Canceling due to interrupt: 1 tests still running
       ABORT [4856.604s] coreutils::tests test_wc::test_files0_progressive_stream
     Message [         ] code 0xc000013a: {Application Exit by CTRL+C}
The application terminated as a result of a CTRL+C. (os error 572)
------------
     Summary [5131.687s] 2147 tests run: 2140 passed (1 slow, 4 leaky), 7 failed, 24 skipped
error: test run failed
Error: The operation was canceled.

it times out

@jeddenlea
Copy link
Contributor Author

nope, I think the test doesn't work on windows. see:

  TRY 1 SLOW [>4800.000s] coreutils::tests test_wc::test_files0_progressive_stream
   Canceling due to interrupt: 1 tests still running
       ABORT [4856.604s] coreutils::tests test_wc::test_files0_progressive_stream
     Message [         ] code 0xc000013a: {Application Exit by CTRL+C}
The application terminated as a result of a CTRL+C. (os error 572)
------------
     Summary [5131.687s] 2147 tests run: 2140 passed (1 slow, 4 leaky), 7 failed, 24 skipped
error: test run failed
Error: The operation was canceled.

it times out

🤦 I guess I could have looked myself! Sorry about that. I think I understand what the failure is, and it shouldn't be too hard to fix in theory. I'm currently in the middle of getting VirtualBox running so I can actually work on the thing in Windows....

jeddenlea added 5 commits May 20, 2023 22:30
The Settings object did not need a QuotingStyle member, it was basically
a static.
print_stats will now take advantage of the buffer built into
io::stdout().

We can also waste fewer lines on show! by making a helper macro.
WcError should not hold Strings which are clones of `&'static str`s.
thiserror provides a convenient API to declare errors with their Display
messages.

wc quotes the --files0-from source slightly differently when reporting
errors about empty file names.
Sadly ilog10 isn't available until we use 1.67, but we can get close in
the meantime.
My previous commits meant to bring our wc's output and behavior in line
with GNU's. There should be tests that check for these changes!

I found a stupid bug in my own changes, I was not adding 1 to the
indexes produced by .enumerate() when printing errors.
@github-actions
Copy link

GNU testsuite comparison:

Congrats! The gnu test tests/tail-2/inotify-dir-recreate is no longer failing!

@jeddenlea
Copy link
Contributor Author

Woo! Looks like I fixed the Windows stuff. That was a.... pain. VirtualBox didn't work as it is wont to do. I ended up using a free t2.micro on AWS, probably the first time I've used Windows since I almost threw a Windows 7 laptop out of my window because it was so bloody slow and awful!

Anyway, in the interest of preventing any individual commits entering the tree which fail CI on their own, I patched the last two here and rebased them all again. You can see the diffs here: https://github.com/jeddenlea/coreutils/compare/old_wc..wc

@sylvestre
Copy link
Contributor

impressive, well done :)

@sylvestre sylvestre merged commit 388fa1b into uutils:main May 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants