split: correct filename creation algorithm #2859

jfinkels · 2022-01-09T02:57:24Z

(Sorry for the large diff here. I tried to find a way to minimize the changes but there were several coupled changes and the algorithm is very unintuitive. I tried to add plenty of comments to help make sense of it.)

Fix two issues with the filename creation algorithm. First, this
corrects the behavior of the -a option. This commit ensures a
failure occurs when the number of chunks exceeds the number of
filenames representable with the specified fixed width:

$ printf "%0.sa" {1..11} | split -d -b 1 -a 1
split: output file suffixes exhausted

Second, this corrects the behavior of the default behavior when -a
is not specified on the command line. Previously, it was always
settings the filenames to have length 2 suffixes. This commit corrects
the behavior to follow the algorithm implied by GNU split, where the
filename lengths grow dynamically by two characters once the number of
chunks grows sufficiently large:

$ printf "%0.sa" {1..91} | ./target/debug/coreutils split -d -b 1 \
>   && ls x* | tail
x81
x82
x83
x84
x85
x86
x87
x88
x89
x9000

tertsdiepraam · 2022-01-09T14:55:14Z

Wow! Great work on figuring this out! That's a really intriguing algorithm. At first, it indeed seemed really unintuitive, but I feel like it make sense looking at it iteratively. So I'm wondering whether it might make sense to embrace that iterative nature and build the next prefix from the last one. For example, I think this works correctly too:

fn main() {
    // The characters in the prefix, but reversed.
    let mut prefix = vec![b'a', b'a'];
    let mut zs = String::new();
    let mut len = 2;
    
    print_str(&zs, &prefix);
    for _ in 1..26*26 {
        for c in prefix.iter_mut() {
            *c += 1;
            // If we reach the character after 'z', we want to wrap back to 'a' and carry the increment to the next char
            // else we can stop.
            if *c > b'z' {
                *c = 'a';
            } else {
                break;
            }
        }
        // We have reached the final name for this length ("za", "zaa", etc.)
        // So set it to 'a' and append an 'a'. Also add a 'z' to the front.
        if prefix[len-1] == b'z' {
            prefix[len-1] = b'a';
            zs.push('z');
            prefix.push(b'a');
            len += 1;
        }
        print_str(&zs, &prefix);
    }
}

// This should be implemented by writing the bytes directly to stdout.
fn print_str(zs: &str, prefix: &[u8]) {
    println!(
        "{}{}",
        zs,
        String::from_utf8(prefix.iter().map(|&x| x).rev().collect::<Vec<_>>()).unwrap()
    )
}

(I also put this in a Rust Playground link, if you want to test it). We could then make this into an iterator of filenames in your Factory struct.

Granted, it's not immediately obvious how it works, but it sidesteps some of the mathematical reasoning necessary in your approach. What do you think?

jfinkels · 2022-01-09T16:25:38Z

Yes, there's something to that idea. My approach of computing the filename directly from the chunk index is more general than necessary, since we always visit chunks in order (that is, index 0, then 1, then 2, etc.). So using an iterator and just computing the successor to the current filename seems sensible. Thanks for coming up with the code for it, the output looks correct to me.

I can work on implementing this as an iterator, but in the meantime don't hesitate to merge this as-is if you decide it is acceptable, since it may take me some time.

tertsdiepraam

Alright! This is indeed too good not to merge. Just a few small nits.

tertsdiepraam · 2022-01-09T21:41:45Z

src/uu/split/src/filenames.rs

+//! Create filenames of the form `chunk_??.txt`:
+//!
+//! ```rust,ignore
+//! use crate::filenames::FilenameFactory;


We can get rid of the ignore, here and in the other doctests. If we mark filenames as pub mod filenames; in split.rs and change this import to use uu_split::filenames::FilenameFactory;. It's not really necessary, though, because you already made so many tests.

That's good to know. I don't think it's worth exposing the module as public just to get these doctests to work, so I'll leave it for now. Thanks for informing me.

src/uu/split/src/filenames.rs

Fix two issues with the filename creation algorithm. First, this corrects the behavior of the `-a` option. This commit ensures a failure occurs when the number of chunks exceeds the number of filenames representable with the specified fixed width: $ printf "%0.sa" {1..11} | split -d -b 1 -a 1 split: output file suffixes exhausted Second, this corrects the behavior of the default behavior when `-a` is not specified on the command line. Previously, it was always settings the filenames to have length 2 suffixes. This commit corrects the behavior to follow the algorithm implied by GNU split, where the filename lengths grow dynamically by two characters once the number of chunks grows sufficiently large: $ printf "%0.sa" {1..91} | ./target/debug/coreutils split -d -b 1 \ > && ls x* | tail x81 x82 x83 x84 x85 x86 x87 x88 x89 x9000

jfinkels · 2022-01-11T01:44:23Z

I have made the requested change and rebased on master.

tertsdiepraam approved these changes Jan 9, 2022

View reviewed changes

jfinkels added 2 commits January 10, 2022 20:43

split: correct arg parameters for -b option

e5d6b7a

jfinkels force-pushed the split-dynamic-suffix-length branch from 1c2fda0 to cfe5a0d Compare January 11, 2022 01:44

sylvestre merged commit 3cc1fb5 into uutils:master Jan 14, 2022

jfinkels deleted the split-dynamic-suffix-length branch January 15, 2022 00:00

jfinkels mentioned this pull request Jan 15, 2022

split: use iterator to produce filenames #2868

Merged

tertsdiepraam mentioned this pull request Sep 26, 2022

Add support for starting suffix numbers #3976

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split: correct filename creation algorithm #2859

split: correct filename creation algorithm #2859

jfinkels commented Jan 9, 2022

tertsdiepraam commented Jan 9, 2022

jfinkels commented Jan 9, 2022

tertsdiepraam left a comment

tertsdiepraam Jan 9, 2022 •

edited

Loading

jfinkels Jan 11, 2022

jfinkels commented Jan 11, 2022

split: correct filename creation algorithm #2859

split: correct filename creation algorithm #2859

Conversation

jfinkels commented Jan 9, 2022

tertsdiepraam commented Jan 9, 2022

jfinkels commented Jan 9, 2022

tertsdiepraam left a comment

Choose a reason for hiding this comment

tertsdiepraam Jan 9, 2022 • edited Loading

Choose a reason for hiding this comment

jfinkels Jan 11, 2022

Choose a reason for hiding this comment

jfinkels commented Jan 11, 2022

tertsdiepraam Jan 9, 2022 •

edited

Loading