-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wc: streaming --files0-from and other improvements #4696
Conversation
GNU testsuite comparison:
|
Can you please run |
Yes. Apologies, I thought I had! |
GNU testsuite comparison:
|
could you please split the PR into smaller commits ? This is a big patch ... |
GNU testsuite comparison:
|
I've broken up the change into 4 different commits. The lines between some concerns were a little blurry, as I ended up touching a lot of stuff, but I've tried my best to keep them each cleanly readable on their own. I'm still working on getting the GNU coreutils tests running completely, but I've at least got the one complaining to run. And, it turns out |
7c5c68c
to
9f6b788
Compare
GNU testsuite comparison:
|
please remove the "draft" when ready to be reviewed :) |
Alright, I think I'm ready for some feedback to iterate on here. On one point in particular, I didn't realize that when I introduced use of AsRawFd/FromRawFd that the import paths were new as of 1.66. Also, the handy |
We do it only after cutting a release and check how important the new features are. Essentially we bump it by need. But always in a separate PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty cool stuff! I would suggest that you keep patches much smaller in the future though.
show_bytes: bool, | ||
show_chars: bool, | ||
show_lines: bool, | ||
show_words: bool, | ||
show_max_line_length: bool, | ||
files0_from_path: Option<PathBuf>, | ||
files0_from: Option<Input<'a>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you make things more complicated for performance reasons (which I assume the lifetimes are for), you do kind of need to show that performance improves. I bet it doesn't really matter, because it happens just once and this is not a hot path. The same applies to some of the Cow
strings. If you just get rid of cloning and do not introduce complexity it's always fine of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get perhaps irrationally obsessive about avoiding unnecessary allocations.... But, I was actually pleased with how simple this whole scheme ended up being. In my mind, 'a
was just short for "command-line Args", which is what we're aiming to borrow from here... I almost came up with more complicated schemes with other lifetimes specified, but they weren't really helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback! I've done my best to address your concerns within each of the 4 commits, and I've rebased them all atop main.
show_bytes: bool, | ||
show_chars: bool, | ||
show_lines: bool, | ||
show_words: bool, | ||
show_max_line_length: bool, | ||
files0_from_path: Option<PathBuf>, | ||
files0_from: Option<Input<'a>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get perhaps irrationally obsessive about avoiding unnecessary allocations.... But, I was actually pleased with how simple this whole scheme ended up being. In my mind, 'a
was just short for "command-line Args", which is what we're aiming to borrow from here... I almost came up with more complicated schemes with other lifetimes specified, but they weren't really helpful.
See this failure:
|
Hey Sylvestere. Yeah, that's what prompted my question about your policy around minimum versions.... If you anticipate moving to 1.67 soon, I'd leave that in place. Otherwise, we'll need to do something different. It's unfortunate Rust hasn't stabilized its version check, https://doc.rust-lang.org/beta/unstable-book/language-features/cfg-version.html |
src/uu/wc/src/wc.rs
Outdated
let input = match maybe_input { | ||
Ok(input) => input, | ||
Err(err) => { | ||
record_error!("{err}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
record_error!("{err}"); | |
show!(e); |
(This might need an into
)
So instead of io::Error -> String -> USimpleError
, we do io::Error -> UIoError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's slick! So now not only do we get to punt on the naming of show, but we still get to make each case of it into a single line which was the whole point of record_error
to begin with.
fn try_iter( | ||
&'a self, | ||
settings: &'a Settings<'a>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure of functions here is what gets you into trouble with a lot of references and the Cow
in Input
. What if you make Input
always a reference and then structure it like this:
fn count_inputs(inputs: Inputs) -> ... {
match inputs {
Inputs::Stdin => count_input(Input::Stdin(StdinKind::Implicit)),
Inputs::Paths(paths) => /* call count_input for each path */,
Self::Files0From(path) => /* call count_input for each line */
}
Then you don't have to pass Iterators
around which makes lifetimes a lot easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The iterator returned by try_iter
takes care of so much, though. It also helps keep the flow of the program consistent with what was happening before. It used to be that it would only iterate through a [Input]
, then we couldn't handle streams. I wanted to be able to handle streams from --files0-from
in the same manner as a list of command line arguments, and it has also made it easy to consistently handle errors like empty file names.
I should have mentioned already, and I'm embarrassed I still haven't gotten to it yet, but it was my intent to write some more tests to capture the new cases in which this change makes this wc
act like GNU's. The formatting of the errors when empty paths are found in either a --files0-from
or the command line is one "feature" we get with this.
I think not just for this.
Honestly, I'm glad 😄 At least for this project, keeping a consistent MSRV would always be the best option. For libraries, it might be different of course. |
Oh sure, this is not a killer feature worth it on its own. I'll see about copying it in verbatim if it's short, or writing a facsimile...
Those jerks just keep adding useful stuff to the standard library! This is certainly not the first time I stumbled onto something useful that just happened to be very new and caused this problem. |
With this PR: #4460 i think it needs a rebase :) |
@jeddenlea on android:
rings a bell ? |
I had made a symlink named |
GNU testsuite comparison:
|
GNU testsuite comparison:
|
The GnuTest failures look like a hiccup with |
nope, I think the test doesn't work on windows. see:
it times out |
🤦 I guess I could have looked myself! Sorry about that. I think I understand what the failure is, and it shouldn't be too hard to fix in theory. I'm currently in the middle of getting VirtualBox running so I can actually work on the thing in Windows.... |
The Settings object did not need a QuotingStyle member, it was basically a static.
print_stats will now take advantage of the buffer built into io::stdout(). We can also waste fewer lines on show! by making a helper macro.
WcError should not hold Strings which are clones of `&'static str`s. thiserror provides a convenient API to declare errors with their Display messages. wc quotes the --files0-from source slightly differently when reporting errors about empty file names.
Sadly ilog10 isn't available until we use 1.67, but we can get close in the meantime.
My previous commits meant to bring our wc's output and behavior in line with GNU's. There should be tests that check for these changes! I found a stupid bug in my own changes, I was not adding 1 to the indexes produced by .enumerate() when printing errors.
GNU testsuite comparison:
|
Woo! Looks like I fixed the Windows stuff. That was a.... pain. VirtualBox didn't work as it is wont to do. I ended up using a free t2.micro on AWS, probably the first time I've used Windows since I almost threw a Windows 7 laptop out of my window because it was so bloody slow and awful! Anyway, in the interest of preventing any individual commits entering the tree which fail CI on their own, I patched the last two here and rebased them all again. You can see the diffs here: https://github.com/jeddenlea/coreutils/compare/old_wc..wc |
impressive, well done :) |
My original focus was on
--files0-from
, which should be handled as a stream if the file specified (or stdin) is not a regular file. I've accomplished this by making a newInputs
type to represent the desired list of files to process. Itstry_iter
method does most of the heavy lifting to figure out how to actually run.Secondarily, I have attempted to reduce the number of String allocations that occur while printing the results. Now, unless a file name needs escaping, or an error occurs, no additional allocations should be necessary to print results. An
Input
now borrows file names from the command line, unless they're read from--files0-from
.print_stats
was the biggest abuser, allocating aString
for each column before before beingjoin
'd into aString
for the whole line. TheTitledWordCount
type is not necessary at all, which was another source ofString
allocation.Finally, I've made some effort to make more cases match GNU wc's output. Errors encountered processing
--files0-from
as well as any encountered processing files listed on the command line will be escaped more consistently like GNU wc. File names printed to the right of their stats will be less aggressively quoted, to match GNU wc, now only if they are not UTF-8 or if they contain a newline.