-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[develop] Add logic to assert that the output of the diff command must do not contain lines starting with "Only in /tmp/home" #2748
[develop] Add logic to assert that the output of the diff command must do not contain lines starting with "Only in /tmp/home" #2748
Conversation
…g data sync and integrity check - Add logic to generate a list of existing files in /home before performing the rsync operation. - Ensure that files and directories present in /home but not in /tmp/home are properly excluded during the diff check. - Handle cases where the same filename exists in both /home and /tmp/home but with different content, by including these files in the exclude list. - Update the rsync and diff commands to ensure accurate synchronization and verification of data integrity. - This fix addresses issues where the /home directory contains pre-existing files such as slurm-*.out generated by running SLURM jobs and the automatically created lost+found directory.
…me before the sync not only the files
ba0dfe8
to
db3c722
Compare
cookbooks/aws-parallelcluster-environment/recipes/init/restore_home_shared_data.rb
Outdated
Show resolved
Hide resolved
cookbooks/aws-parallelcluster-environment/recipes/init/restore_home_shared_data.rb
Outdated
Show resolved
Hide resolved
cookbooks/aws-parallelcluster-environment/recipes/init/restore_home_shared_data.rb
Outdated
Show resolved
Hide resolved
cookbooks/aws-parallelcluster-environment/recipes/init/restore_home_shared_data.rb
Show resolved
Hide resolved
cookbooks/aws-parallelcluster-environment/recipes/init/restore_home_shared_data.rb
Outdated
Show resolved
Hide resolved
cookbooks/aws-parallelcluster-environment/recipes/init/restore_home_shared_data.rb
Outdated
Show resolved
Hide resolved
cookbooks/aws-parallelcluster-environment/recipes/init/restore_home_shared_data.rb
Outdated
Show resolved
Hide resolved
…. 2. Minor changes of comments.
…options.txt not found error
cookbooks/aws-parallelcluster-environment/recipes/init/restore_home_shared_data.rb
Outdated
Show resolved
Hide resolved
…d of using overcomplicating loop. Apply the same logic to restore_internal_use_shared_data.rb and config_default_user_home.rb
cookbooks/aws-parallelcluster-environment/spec/unit/recipes/config_default_user_home_spec.rb
Outdated
Show resolved
Hide resolved
cookbooks/aws-parallelcluster-environment/recipes/init/config_default_user_home.rb
Show resolved
Hide resolved
cookbooks/aws-parallelcluster-environment/recipes/init/config_default_user_home.rb
Show resolved
Hide resolved
@@ -29,11 +29,13 @@ | |||
expect(chef_run).to run_bash("Verify data integrity for #{user_home}").with( | |||
code: <<-CODE | |||
diff_output=$(diff -r #{user_home} #{user_local_home}) | |||
if [ $? -eq 0 ]; then | |||
if [[ $diff_output != *"Only in #{user_home}"* ]]; then | |||
rm -rf /tmp#{user_home} | |||
rm -rf #{user_home} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I'm just noticing this rm -rf #{user_home}
. This is very risky and should never be done.
We should only delete temporary folders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a new user_local_home
. Why we still need user_home
which has the data user_local_home
already have? This rm -rf #{user_home}
is the existing logic before I created the data integrity check PR. And we have used it for a long time. So it should be okay.
Also, it's not a recipe, it's a spec test.
My concern is maybe the temp folder is useless. In this case, it's not like the restore_home_shared_data.rb
, in restore_home_shared_data.rb
we replace /home dir, so we need a temp folder to store data. But in this case, we can directly sync data from #{node['cluster']['cluster_user_home'] to #{node['cluster']['cluster_user_local_home']. But also the temp folder is the existing logic before I created the data integrity check PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a new user_local_home. Why we still need user_home which has the data user_local_home already have?
We should avoid to do rm -rf
on home directories because we do not want to take any risk of data loss, regardless if we need the data or not. We should not take the responsibility of removing /home data.
What if, for whatever race condition, a remote file system is still attached to the original home directory?
In that case that rm -rf would remove everything from the remote file system and we would cause data loss.
We can say that there exist already a mitigation to this case: this recipe is executed before the recipes that mount the remote file systems, so we may be confident that at this point of the execution, no custom data exist yet in the home directories.
This is a good mitigation, but it's still not enough:
- What if there exist a corner case where a remote file system is mounted before it?
A very improbable, but still possible, example may be a custom AMI with a init.d script that mounts a remote file system. I agree that it's really really improbable, is the value worth the risk? - Furthermore, at some point we may reorder our recipes forgetting about the requirement for this recipe to be executed at the beginning, making the recursive deletion a potential source of data loss.
To sum up, we have a 99% mitigation of the data loss risk, but the potential cost of that 1% is disruptive.
In general, we should reason in terms of cost/opportunity.
- What the worst case cost of doing rm -rf? In the worst case it causes data loss, so potentially millions of dollars in data and reputation.
- What's the cost of keeping the original home directory? A potential confusion in the user that may not understand why the original home is kept. This would cost barely nothing to the cx and only potential tickets to us that could be resolved with documentation.
Also, it's not a recipe, it's a spec test.
A spec test typically reflects what the recipe is doing. If it does reflect it, then my comment applies to both the test and the recipe. If the test does not reflect the recipe, then we have a problem in the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that explain my confusion, thank you Giacomo. I totally agree with the potential risks you mentioned.
Furthermore, at some point we may reorder our recipes forgetting about the requirement for this recipe to be executed at the beginning, making the recursive deletion a potential source of data loss.
You are correct! Can not agree more.
A spec test typically reflects what the recipe is doing. If it does reflect it, then my comment applies to both the test and the recipe. If the test does not reflect the recipe, then we have a problem in the test.
I know! I was just trying to say we should look at the recipe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and add a comment for it.
…e system data loss risks.
…t do not contain lines starting with "Only in /tmp/home" (aws#2748) Fix test_shared_home integ test failure - The `test_shared_home` test is failing caused by integrity check process in `restore_home_shared_data.rb` before delete the backup directory. Now fix the test. - Add logic to assert that the output of the `diff` command must do not contain lines starting with "Only in /tmp/home" - This fix addresses issues where the filesystem contains pre-existing files - Apply the same logic to `restore_internal_use_shared_data.rb` and `config_default_user_home.rb` to avoid any potential risks. Also modified their spec test to adopt the new changes - We decided to keep node['cluster']['cluster_user_home'] to avoid any potential file system data loss risks
Description of changes
test_shared_home
test is failing caused by integrity check process inrestore_home_shared_data.rb
before delete the backup directory. Now fix the test.diff
command must do not contain lines starting with "Only in /tmp/home"restore_internal_use_shared_data.rb
andconfig_default_user_home.rb
to avoid any potential risks. Also modified their spec test to adopt the new changes.Tests
test_shared_home
(("FsxOpenZfs", "Efs"), ("FsxOntap", "Efs"), ("Ebs", "Efs")) integ tests can now successfully passReferences
Checklist
develop
add the branch name as prefix in the PR title (e.g.[release-3.6]
).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.