Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split: correct filename creation algorithm #2859

Merged
merged 2 commits into from
Jan 14, 2022

Conversation

jfinkels
Copy link
Collaborator

@jfinkels jfinkels commented Jan 9, 2022

(Sorry for the large diff here. I tried to find a way to minimize the changes but there were several coupled changes and the algorithm is very unintuitive. I tried to add plenty of comments to help make sense of it.)

Fix two issues with the filename creation algorithm. First, this
corrects the behavior of the -a option. This commit ensures a
failure occurs when the number of chunks exceeds the number of
filenames representable with the specified fixed width:

$ printf "%0.sa" {1..11} | split -d -b 1 -a 1
split: output file suffixes exhausted

Second, this corrects the behavior of the default behavior when -a
is not specified on the command line. Previously, it was always
settings the filenames to have length 2 suffixes. This commit corrects
the behavior to follow the algorithm implied by GNU split, where the
filename lengths grow dynamically by two characters once the number of
chunks grows sufficiently large:

$ printf "%0.sa" {1..91} | ./target/debug/coreutils split -d -b 1 \
>   && ls x* | tail
x81
x82
x83
x84
x85
x86
x87
x88
x89
x9000

@tertsdiepraam
Copy link
Member

Wow! Great work on figuring this out! That's a really intriguing algorithm. At first, it indeed seemed really unintuitive, but I feel like it make sense looking at it iteratively. So I'm wondering whether it might make sense to embrace that iterative nature and build the next prefix from the last one. For example, I think this works correctly too:

fn main() {
    // The characters in the prefix, but reversed.
    let mut prefix = vec![b'a', b'a'];
    let mut zs = String::new();
    let mut len = 2;
    
    print_str(&zs, &prefix);
    for _ in 1..26*26 {
        for c in prefix.iter_mut() {
            *c += 1;
            // If we reach the character after 'z', we want to wrap back to 'a' and carry the increment to the next char
            // else we can stop.
            if *c > b'z' {
                *c = 'a';
            } else {
                break;
            }
        }
        // We have reached the final name for this length ("za", "zaa", etc.)
        // So set it to 'a' and append an 'a'. Also add a 'z' to the front.
        if prefix[len-1] == b'z' {
            prefix[len-1] = b'a';
            zs.push('z');
            prefix.push(b'a');
            len += 1;
        }
        print_str(&zs, &prefix);
    }
}

// This should be implemented by writing the bytes directly to stdout.
fn print_str(zs: &str, prefix: &[u8]) {
    println!(
        "{}{}",
        zs,
        String::from_utf8(prefix.iter().map(|&x| x).rev().collect::<Vec<_>>()).unwrap()
    )
}

(I also put this in a Rust Playground link, if you want to test it). We could then make this into an iterator of filenames in your Factory struct.

Granted, it's not immediately obvious how it works, but it sidesteps some of the mathematical reasoning necessary in your approach. What do you think?

@jfinkels
Copy link
Collaborator Author

jfinkels commented Jan 9, 2022

Yes, there's something to that idea. My approach of computing the filename directly from the chunk index is more general than necessary, since we always visit chunks in order (that is, index 0, then 1, then 2, etc.). So using an iterator and just computing the successor to the current filename seems sensible. Thanks for coming up with the code for it, the output looks correct to me.

I can work on implementing this as an iterator, but in the meantime don't hesitate to merge this as-is if you decide it is acceptable, since it may take me some time.

Copy link
Member

@tertsdiepraam tertsdiepraam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright! This is indeed too good not to merge. Just a few small nits.

//! Create filenames of the form `chunk_??.txt`:
//!
//! ```rust,ignore
//! use crate::filenames::FilenameFactory;
Copy link
Member

@tertsdiepraam tertsdiepraam Jan 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get rid of the ignore, here and in the other doctests. If we mark filenames as pub mod filenames; in split.rs and change this import to use uu_split::filenames::FilenameFactory;. It's not really necessary, though, because you already made so many tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good to know. I don't think it's worth exposing the module as public just to get these doctests to work, so I'll leave it for now. Thanks for informing me.

src/uu/split/src/filenames.rs Outdated Show resolved Hide resolved
Fix two issues with the filename creation algorithm. First, this
corrects the behavior of the `-a` option. This commit ensures a
failure occurs when the number of chunks exceeds the number of
filenames representable with the specified fixed width:

    $ printf "%0.sa" {1..11} | split -d -b 1 -a 1
    split: output file suffixes exhausted

Second, this corrects the behavior of the default behavior when `-a`
is not specified on the command line. Previously, it was always
settings the filenames to have length 2 suffixes. This commit corrects
the behavior to follow the algorithm implied by GNU split, where the
filename lengths grow dynamically by two characters once the number of
chunks grows sufficiently large:

    $ printf "%0.sa" {1..91} | ./target/debug/coreutils split -d -b 1 \
    >   && ls x* | tail
    x81
    x82
    x83
    x84
    x85
    x86
    x87
    x88
    x89
    x9000
@jfinkels jfinkels force-pushed the split-dynamic-suffix-length branch from 1c2fda0 to cfe5a0d Compare January 11, 2022 01:44
@jfinkels
Copy link
Collaborator Author

I have made the requested change and rebased on master.

@sylvestre sylvestre merged commit 3cc1fb5 into uutils:master Jan 14, 2022
@jfinkels jfinkels deleted the split-dynamic-suffix-length branch January 15, 2022 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants