-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Common "footguns" when writing CLI applications #29
Comments
I'm not sure in the year 2018 anyone should not be expecting utf-8 input 🤔 |
@spacekookie If your tool is designed to process arbitrary files, and that processing can be done in a somewhat encoding agnostic way (for example, one might only need the assumption that the text is ASCII compatible), then you really can't use On Windows this gets a bit more complicated. If you write a CLI tool for Windows that reads files, then at some point, you're likely to come across UTF-16. This can be handled automatically via BOM-sniffed and transcoding, but there's no out-of-the-box convenient abstraction to handle this. (There are nominal |
Sounds like a good candidate for any interested party in the WG to take on. |
For streaming transcoding, see:
|
The String issue is, in my view, huge. If your CLI app takes a filename but can't handle my non-UTF8-but-perfectly-valid argument, for example, that's a bug. And in general, non-String-but-text-data ([u8], OsString) feels very second-class-citizen in Rust. Most libraries/APIs/macros require String... such as std::fmt (including Debug and Display). And putting unchecked data into a String is considered UB. Ideally, an overarching trait or similar could be created to expand existing systems to the non-UTF8 story. I'm sure that's not a 2018 possibility though, and would be interested to see what the WG suggestions in the meantime. (Side note, since the OP mentioned the "ideal guide" issue: the quicli docs are mentioned as a "concrete example" there. Their front page example has file: String.) |
@jeekobu An RFC was recently merged that adds I think fixing the "programs demand UTF-8, even for file paths" bug is the realm of argv parsers. They should make it very easy to handle file paths correctly. |
The Pattern extension is why I have some hope that quality &[u8] and/or OsStr support is a future possibility Anyway, after I made that comment, I found #26 which more directly addresses this hurdle. |
Unfortunately managed to shoot myself in the foot regarding reading lines from stdin, see here https://www.reddit.com/r/rust/comments/aaood3/go_version_of_program_is_40_quicker_than_rust/ and follow up gist here https://gist.github.com/djhworld/ced7eabca8e0c06d5b65917e87fb8745 I think it would be great if the CLI book could at least document a recommended way of reading newline deilimited input from stdin to avoid falling into these traps, or at least have an example application that does something trivial like converting input to uppercase or something. |
@djhworld very good point! I have a TODO open for stdin in #95. I'm not entirely sure what the book should recommend precisely; I'd say using a buffered reader is uncontroversial. Reusing the line buffer however will require you to be very careful in that you always clear it and don't expect to be able to store it permanently without cloning. Do you think that reading from stdin, using a buffered reader, will often be a performance issue? Or would other parts of the program be usually vastly slower? |
I think putting this in the book, in almost these exact words would be super helpful. Coupled with an example it would probably help people a lot.
I've seen the issue of folks unknowingly acquiring a lock on stdin over and over come up several times. It seems that at least for trivial benchmarks / exploratory work this is an actual issue. I suspect this might also be an actual bottleneck for people building filtering shell tools, similar to |
WG is the acronym for ... ?? |
Working Group |
Is there any recommended pattern or best-practice for writing a piping CLI that efficiently reads from and writes to stdin/stdout? Even having a reference implementation to look at would be useful. |
It seems like we're coming across a few small "footguns", some common some less so, so it may be nice to collect them in one place so when writing something like #6 we don't forget anything.
This could happen in #6, but I'd prefer if it were separate so this can dive a little deeper and actually make a list.
Off the top of my head from stuff that's come up today in the gitter and elsewhere:
stdin
line by line is slow, because it does a heap alloc at each line. Easiest solution isBufReader::read_lines
String
as an input type for arbitrary data that doesn't necessarily need to be UTF-8println!
when writing lots of small output, which locksstdout
at each call. Prefer lockingstdout
manually and using something likewrite!
libc::close
manually, and checking for errors)println!
can panic on a broken pipestd::env::args
panics at invalid UTF-8 (the fix is to usestd::env::args_os
but this is not very intuitive, and with issues like Ergonomics of String handling #26 it's far more friction than it should be)If you've got more I'll add them. Of course I think we should also be open to discussion about the ones I listed as to why they should, or should not be listed and/or other fixes for them. 😄
Gitter log of reading `stdin` line by line
@rcoh 18:37: Hello! I'm working on https://github.com/rcoh/angle-grinder in the process, I noticed that reading from stdin via .lines() is pretty slow there are ways around it, of course. ripgrep uses a custom written stdin buffer
@BurntSushi 18:38: yeah it creates an alloc for each new line. ripgrep's buffer is quite a bit more involved. :grinning: easiest path to something fast is BufReader::read_line
@rcoh 18:38: In any case, I was wondering if there were any plans to make reading lots of data from stdin a more generally supported pattern or if ripgrep's pattern could be abstracted into a library in some way
@BurntSushi 18:39: ripgrep doesn't really read line-by-line, so it's a bit different
@rcoh 18:40: but I assume at some point in the internals, it's splitting things into lines?
@BurntSushi 18:40: nope
@rcoh 18:40: oh interesting!
@BurntSushi 18:40: only for output
@rcoh 18:40: doesn't grep have line-by-line semantics? Does ripgrep as well?
@BurntSushi 18:41: in the (very common) case that searching a file yields no matches, it is possible that the concept of a "line" never actually materializes at all
@rcoh 18:41: interesting.
@BurntSushi 18:41: see: https://blog.burntsushi.net/ripgrep/#mechanics (GNU grep does the same thing)
@rcoh 18:42: My current approach is:
Is that fastest I can do without resorting to a totally different strategy?
@BurntSushi 18:42: pretty much, yes. if you could use a Vec instead of a String, then you can use read_until and avoid the UTF-8 validation check. depends on what you're doing though!
if you're processing data from arbitrary files, then String is very likely the wrong thing to use
@rcoh 18:44: it's intended to process log files so it's almost always printable data except in some weird cases
@BurntSushi 18:44: it all depends on whether you're OK with it choking on non-UTF-8 encoded text
@rcoh 18:53: Got it. Thanks for your help! I literally couldn't think of a more qualified person to have answer my question :grinning:
@BurntSushi 18:53: :grinning: FWIW, I do intend on splitting the core ripgrep search code out into the grep crate, which will include the buffer handling. it's likely possible that the API will be flexible enough to do line-by-line parsing as fast as possible (although, avoiding line-by-line will still of course be faster for many workloads)
@rcoh 19:02: Cool. I don't actually need to be line-by-line until after running some filtering so that would be useful for my use case.
The text was updated successfully, but these errors were encountered: