-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
split: correct filename creation algorithm #2859
split: correct filename creation algorithm #2859
Conversation
Wow! Great work on figuring this out! That's a really intriguing algorithm. At first, it indeed seemed really unintuitive, but I feel like it make sense looking at it iteratively. So I'm wondering whether it might make sense to embrace that iterative nature and build the next prefix from the last one. For example, I think this works correctly too: fn main() {
// The characters in the prefix, but reversed.
let mut prefix = vec![b'a', b'a'];
let mut zs = String::new();
let mut len = 2;
print_str(&zs, &prefix);
for _ in 1..26*26 {
for c in prefix.iter_mut() {
*c += 1;
// If we reach the character after 'z', we want to wrap back to 'a' and carry the increment to the next char
// else we can stop.
if *c > b'z' {
*c = 'a';
} else {
break;
}
}
// We have reached the final name for this length ("za", "zaa", etc.)
// So set it to 'a' and append an 'a'. Also add a 'z' to the front.
if prefix[len-1] == b'z' {
prefix[len-1] = b'a';
zs.push('z');
prefix.push(b'a');
len += 1;
}
print_str(&zs, &prefix);
}
}
// This should be implemented by writing the bytes directly to stdout.
fn print_str(zs: &str, prefix: &[u8]) {
println!(
"{}{}",
zs,
String::from_utf8(prefix.iter().map(|&x| x).rev().collect::<Vec<_>>()).unwrap()
)
} (I also put this in a Rust Playground link, if you want to test it). We could then make this into an iterator of filenames in your Factory struct. Granted, it's not immediately obvious how it works, but it sidesteps some of the mathematical reasoning necessary in your approach. What do you think? |
Yes, there's something to that idea. My approach of computing the filename directly from the chunk index is more general than necessary, since we always visit chunks in order (that is, index 0, then 1, then 2, etc.). So using an iterator and just computing the successor to the current filename seems sensible. Thanks for coming up with the code for it, the output looks correct to me. I can work on implementing this as an iterator, but in the meantime don't hesitate to merge this as-is if you decide it is acceptable, since it may take me some time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright! This is indeed too good not to merge. Just a few small nits.
//! Create filenames of the form `chunk_??.txt`: | ||
//! | ||
//! ```rust,ignore | ||
//! use crate::filenames::FilenameFactory; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can get rid of the ignore
, here and in the other doctests. If we mark filenames
as pub mod filenames;
in split.rs
and change this import to use uu_split::filenames::FilenameFactory;
. It's not really necessary, though, because you already made so many tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good to know. I don't think it's worth exposing the module as public just to get these doctests to work, so I'll leave it for now. Thanks for informing me.
Fix two issues with the filename creation algorithm. First, this corrects the behavior of the `-a` option. This commit ensures a failure occurs when the number of chunks exceeds the number of filenames representable with the specified fixed width: $ printf "%0.sa" {1..11} | split -d -b 1 -a 1 split: output file suffixes exhausted Second, this corrects the behavior of the default behavior when `-a` is not specified on the command line. Previously, it was always settings the filenames to have length 2 suffixes. This commit corrects the behavior to follow the algorithm implied by GNU split, where the filename lengths grow dynamically by two characters once the number of chunks grows sufficiently large: $ printf "%0.sa" {1..91} | ./target/debug/coreutils split -d -b 1 \ > && ls x* | tail x81 x82 x83 x84 x85 x86 x87 x88 x89 x9000
1c2fda0
to
cfe5a0d
Compare
I have made the requested change and rebased on master. |
(Sorry for the large diff here. I tried to find a way to minimize the changes but there were several coupled changes and the algorithm is very unintuitive. I tried to add plenty of comments to help make sense of it.)
Fix two issues with the filename creation algorithm. First, this
corrects the behavior of the
-a
option. This commit ensures afailure occurs when the number of chunks exceeds the number of
filenames representable with the specified fixed width:
Second, this corrects the behavior of the default behavior when
-a
is not specified on the command line. Previously, it was always
settings the filenames to have length 2 suffixes. This commit corrects
the behavior to follow the algorithm implied by GNU split, where the
filename lengths grow dynamically by two characters once the number of
chunks grows sufficiently large: