-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include symlinks #261
Include symlinks #261
Conversation
I tried experimenting with symlinks in With the changes in this branch:
On the
The results are the same, which is not what I expected. I expected the sym-linked (Note: this tests sym-linking to a file in the directory being archived. I believe there would be differences if we sym-linked to a file in a different directory). |
The final "zstash ls --hpss=zstash_archive" (and "zstash ls --hpss=zstash_archive2") does not reveal whether the content exists for the files named. Use ordinary "ls -l" to ensure they are as expected. Also, the symlink "zstash_demo/file2.txt" points to a file "zstash_demo/file0.txt" that is present in the archive anyway. There is no opportunity for that link to be broken. The issue does not involve empty files. There is nothing wrong (necessarily) with archiving a file with no content. The problem involves (somehow) archiving a symlink to a file that does not exist - either before the archive is created, or afterwards WHEN that path to the real file was NOT already part of the archive (and will not be an available path where the archive is extracted.) One scenario is to zstash a directory of regular files, containing a broken link (add a link-file to a real-file in a separate directory, then delete that latter real-file BEFORE making the archive. What happens when zstash tries to archive the directory containing a broken link? Does it complain about the broken link? Does it quietly include the broken link in the archive? Another scenario is to leave the real-file (in a separate directory) intact when forming the zstash archive, then destroying the real file, and attempting to zstash extract. This (I think) would simulate taking the zstash archive to a new file-system where the "symlink" (link-file) should no longer work. Was the sym-link exercised (is the real-file present in the archive? Or is it now just a broken link?) As you mentioned under "issues 247", you could call tarinfo.issym (https://docs.python.org/3/library/tarfile.html#tarfile.TarInfo) to see if a file is a symlink. If so, replace "link-file" with os.path.realpath(link-file) (https://docs.python.org/3/library/os.path.html). The remaining issue would be to ensure that a file "link-named" as "FILE_A", that was a symlink to a distant-path to "FILE_B", not only gets copied over, but remains named FILE_A and is not tarred as "distant_path/FILE_B" (the value returned by "realpath"). |
In both cases,
It appears to just work somehow? On both branches (
This also appears to work. On both branches, |
Are you sure that is the actual file, and not just a "link-name"? That should be impossible. If I create a symlink to a file in a distant directory, then delete that target file, the symlink becomes broken. It cannot be used to access the missing data. If you then zstash the directory containing the symlink, *something" may end up in the archive, but it cannot be the target file that does not exist, We have a zstash archive: /p/user_pub/e3sm/archive/2_0/DECK-v2/v2.NARRM.amip_0201/ If I conduct "zstash ls -l --hpss=none archive/atm/hist/v2.NARRM.amip_0201.eam.h0.*.nc" what I see includes:
Somehow, the sim-team managed to create a 0-length file (archive/atm/hist/v2.NARRM.amip_0201.eam.h0.1952-12.nc) in the archive. I don't know how they did it, but there must be a way to avoid this - throw up a warning or exception at archive-time. When it is discovered only months later, it leads to far more recovery work. I will try to develop a test regime to duplicate the process. But your example ("appears to just work somehow?") demonstrates that zstash is actually NOT working. Otherwise, zstash is just inventing files that do not exist. |
The commands I ran were essentially the same as above, but with the following differences:
Is there a way to determine that?
Yes, that would be great. Thank you! |
If you delete the file "non_archived/file0.txt" BEFORE creating the archive, the link "zstash_demo/file2.txt" should be broken. If you extract the file "zstash_demo/file2.txt", from the later completed archive, and then issue cat zstash_demo/file2.txt You should get an error. You should not get 'file0 stuff' because the actual file no longer existed when the archive was created. Otherwise - magic is present... P.S. Echo more text into file0 before starting: echo 'file0_supercalifragilisticexpialadocious_stuff' > non_archived/file0.txt |
Hmm |
Then perhaps the archive was merely being "appended to" instead of being "recreated"? Here is my test:
(for some reason, I cannot read the "index.db" file.)
At least, there were no errors . . . I am not very practiced at making archives. Can you try this (make an archive of that "sourcedata" directory?) |
Addendum: Here is "zstash ls -l"
(Edit): Also:
|
Ah, so it turns out I was using Updated results (
Unfortunately, this doesn't appear to illuminate much. I'm not seeing why the results are the same in all 3 cases.... I'll keep debugging and try out your test commands as well. |
Also, I am only using "zstash . . . --hpss=none", which is not what the simulation teams will use when creating their archives. But (in principle) that should not matter with regard to treatment of broken symlinks, etc. (Perhaps different execution paths are exercised?) One way or another, broken links are being archived - and we need to prevent it (detect early). |
@TonyB9000 I tried using |
I do all of my work on acme1, and run all jobs from the mounted /p/user_pub/ space. |
I don't have access to I'm assuming the following behavior is what we would actually want after extraction:
Is that right? |
Not quite: On the "delete before" test, I do NOT get a zstash create failure. What I get:
If I examine the raw tar file that zstash created, it contains:
if I use zstash to extract all content of the archive, I get:
As you can see - nothing was extracted. But zstash is happy. Addendum: Here is "zstash ls -l"
(Edit): Also:
|
@TonyB9000 Sorry, in the table above, I meant that those are the results we would want to see (i.e., not what the current implementation of Also, it looks like you're running
|
Actually, when I create a zstash archive, like
I create a "Holodeck" directory elsewhere, as follows:
and I "reside" in Holodeck (above the "zstash" directory) when issuing zstash commands. I employ this "Holodeck" method to insulate myself from the fact that - depending on parameters, zstash would delete the very archived from which files are extracted. When HPSS is present, the tape-extracted tar-files are only temporary. But for our purposes, they ARE the archives. This way, at worst, only the symlinks are deleted. |
@TonyB9000 I updated the run script and results: Updated run script:
Note that Updated results (
|
Results of running
|
The results we would actually want from
That is, because we'd be making hard copies of the sym links, we'd always expect to have the data available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TonyB9000 Results of running cat
on the sym-links, in the zstash_extraction
directory, using the latest commit:
Case | main file3 |
main file4 |
n247-symlinks file3 |
n247-symlinks file4 |
---|---|---|---|---|
Don't delete original file | file1 stuff |
file2 stuff |
file1 stuff |
file2 stuff |
Delete before create | cat: file3.txt: No such file or directory |
cat: file4.txt: No such file or directory |
cat: file3.txt: No such file or directory |
cat: file4.txt: No such file or directory |
Delete after create | cat: file3.txt: No such file or directory |
cat: file4.txt: No such file or directory |
file1 stuff |
file2 stuff |
I think that's actually what we want, right? If the file being linked to is deleted before zstash create
, there's nothing we can do in that command to get it back.
If this is the correct behavior, I just need to make a command line flag to use the old (current) behavior instead.
FYI: If I employ the name "TOP" to mean the full path to the test directory (zstash_diagnostic), and run "setup.sh XXX", the following is created:
In short, your "file3" and "file4" are always symlinks to "file1" and "file2", respectively. Let me refer to your table by its rows: Row 2: It is not enough that we can't recover a file that was not (could not be) archived. We must abort the archiving itself, so we do not produce an archive that has broken links as content. What does "zstash ls -l *" say about the archive? Do the "broken" links appear at all? There should be a flag that allows "broken links" to be archived, if needed. Row 3: That is the expected behavior. The archive contains the files themselves (existing at archive-creation time, but named according to the symlink names), and not merely archiving the symlinks. |
Good to know, thanks.
Yes, that's how I set up the initial conditions.
I think this does do that by propagation of the
We can't get there because the archive doesn't get created. Instead, there's a confusing
Awesome. |
FYI, I just discovered you can use
since the "-n" omits the newline. |
I think the behavior for "--hpss=<archive_name>" and "--hpss=none" should be distinct. Error messages for one are not always appropriate for the other. But if there is no archive found with a given name, then "zstash ls" and "zstash extract" should say so., and not say "Error transferring file from HPSS". That would be weird. If I issue in bash, "cp fileA fileB" and fileA does not exist, I don't want to see "Error copying fileA to fileB, FileA may not exist". I want to see "Error: 'FileA' not found. |
@TonyB9000 Results from latests changes, on With HPSS:
Without HPSS:
Ok, I think the "Without HPSS" cases are functioning as we want, correct? (Although I suppose I should change the error message for As for the "With HPSS?" cases, it looks like |
This looks great Ryan! This behavior should prevent "bad archives" from slipping out into the open. For "zstash ls", I agree it should behave like (bash) ls. Programmers (like me) will sometimes use "filecount=`ls dir | wc -l`", and expect filecount to be 0 when no files (no "lines") are returned. |
@TonyB9000 Results from latest changes, on I decided to make the new behavior (rather than the behavior in I think I've copied everything over correctly. We're looking at 12 cases here - [using With
With
Without
Without
|
zstash/ls.py
Outdated
# Ignore failed commands | ||
# If the match list is empty, then `zstash ls` will display nothing, | ||
# just as `ls` does. | ||
return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is getting used. The "nothing to ls" message still gets printed in some cases.
The changes look good to me, although I am personally unfamiliar with using zstash with HPSS, in either "create" or "extract" (or for that matter, "ls") modes. (Graphical/pictorial representations of steps depicting tape-archive, local caches, final destinations, etc with create and extract would probably be useful someday, for HPSS and non-HPSS usages.) When you say the "new behavior" will be the command-line option, does this apply only to create? I am not familiar with the "short-term-archiving" process (post simulation-runs), so I don't know where this would be applied. Would I need to apply the command-line option when perusing or extracting from "caches" that are our "zstash archives"? |
@forsyth2 and @TonyB9000: great new feature, thanks for working on it. I would suggest to reconsider the naming of the command line option:
|
I second the motion. --dereference, (or --follow-symlinks) would be better than "--copy-symlinks". It should be clear in the usage statement that a failure to dereference a symlink will result in a failed archive. I just hope that folks won't avoid it on that account. I'd prefer not to receive broken archives... |
I created #262 for this.
It will be a command-line option for
Short-term archiving is independent of, and done well before, any calls to
See my second comment above.
Sure, I can change
I can add that as well. |
c952c91
to
400be37
Compare
Ok, I think this is ready to merge. New behavior is as follows: With
With
Without
Without
|
Looks good to me Ryan - Excellent work and thorough coverage. I do wonder how it will run for me (zstash ls) on an archive with broken symlinks created using the older zstash. Of course, "zstash ls -l" always worked correctly (demonstrated file length 0, md5 hash = None), but that was an inconvenient way to discover broken links. Ideally, the new "zstash ls --follow-symlinks filename" on an old archive should not list "filename" if it is a broken symlink (i.e. just a symlink, "not followed"). I guess this begs the question: If people create new archives without "--follow-symlinks" (original behavior), and I receive the "cache" as a non-HPSS archive, and "file" is a broken symlink, will the new "zstash ls file --follow-symlinks" simply list the file as if present? Or will it behave differently... |
Thanks @TonyB9000!
Ah, so this does not exist. The
If the archive is created without
|
OK, understood. It would probably be a hassle for every "zstash ls" to do an internal "ls -l" and test for "md5=None" or "len=0", which was the only way I confirmed I was looking at a broken link. I can probably use (and parse) "zstash ls -l" instead of just calling "zstash ls" to test for file existing/location stuff. |
Sounds good. If it turns out to be too much of a problem, we can create a new issue to handle that. |
Include symlinks. Resolves #247