Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --include option #199

Closed
golaz opened this issue Feb 23, 2022 · 2 comments · Fixed by #264
Closed

Add --include option #199

golaz opened this issue Feb 23, 2022 · 2 comments · Fixed by #264

Comments

@golaz
Copy link
Collaborator

golaz commented Feb 23, 2022

In some cases, it would be useful to have an --include option in addition to the existing --exclude option. If both options are specified at the same time, start with all files in --include and remove files as specified in --exclude.

@forsyth2
Copy link
Collaborator

From @PeterCaldwell in #248:

zstash create --exclude is very handy. It would be nice to also have a zstash create --include option.

For example, for the 50 TB per simulated month SCREAM runs I'm doing, I want to be able to zstash all the "T_2m" files together without also having to zstash a bunch of other files (which are individually 200 GB to 2 TB in size).

@forsyth2
Copy link
Collaborator

It looks like this is the function we'd need to change: https://github.com/E3SM-Project/zstash/blob/main/zstash/utils.py#L58

def get_files_to_archive(cache: str, exclude: str) -> List[str]:
    # List of files
    logger.info("Gathering list of files to archive")
    # Tuples of the form (path, filename)
    file_tuples: List[Tuple[str, str]] = []
    # Walk the current directory
    for root, dirnames, filenames in os.walk("."):
        if not dirnames and not filenames:
            # There are no subdirectories nor are there files.
            # This directory is empty.
            file_tuples.append((root, ""))
        for filename in filenames:
            # Loop over files
            # filenames is a list, so if it is empty, no looping will occur.
            file_tuples.append((root, filename))

    # Sort first on directories (x[0])
    # Further sort on filenames (x[1])
    file_tuples = sorted(file_tuples, key=lambda x: (x[0], x[1]))

    # Relative file paths, excluding the cache
    files: List[str] = [
        os.path.normpath(os.path.join(x[0], x[1]))
        for x in file_tuples
        if x[0] != os.path.join(".", cache)
    ]

    # Eliminate files based on exclude pattern
    if exclude is not None:
        files = exclude_files(exclude, files)

    return files

It looks like we always do for root, dirnames, filenames in os.walk("."):, so perhaps that just needs to be changed to for root, dirnames, filenames in os.walk(f"{included_files}"):, where included_files is the value given to --include. I'm not quite sure how parsing will work out for 1) individual files versus directories and 2) wildcards like "*".

@forsyth2 forsyth2 moved this from Todo to In Progress in forsyth2 current tasks May 12, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in forsyth2 current tasks May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants