Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Add logic to assert that the output of the diff command must do not contain lines starting with "Only in /tmp/home" #2748

Merged
merged 16 commits into from
Jun 18, 2024

Conversation

hehe7318
Copy link
Contributor

@hehe7318 hehe7318 commented Jun 10, 2024

Description of changes

  • The test_shared_home test is failing caused by integrity check process in restore_home_shared_data.rb before delete the backup directory. Now fix the test.
    • Add logic to assert that the output of the diff command must do not contain lines starting with "Only in /tmp/home"
    • This fix addresses issues where the /home directory contains pre-existing files such as slurm-*.out generated by running SLURM jobs, the automatically created lost+found directory and etc.
  • Apply the same logic to restore_internal_use_shared_data.rb and config_default_user_home.rb to avoid any potential risks. Also modified their spec test to adopt the new changes.
  • We decided to keep node['cluster']['cluster_user_home'] to avoid any potential file system data loss risks.

Tests

  • Failed test_shared_home(("FsxOpenZfs", "Efs"), ("FsxOntap", "Efs"), ("Ebs", "Efs")) integ tests can now successfully pass

References

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

hehe7318 added 3 commits June 9, 2024 01:56
…g data sync and integrity check

- Add logic to generate a list of existing files in /home before performing the rsync operation.
- Ensure that files and directories present in /home but not in /tmp/home are properly excluded during the diff check.
- Handle cases where the same filename exists in both /home and /tmp/home but with different content, by including these files in the exclude list.
- Update the rsync and diff commands to ensure accurate synchronization and verification of data integrity.
- This fix addresses issues where the /home directory contains pre-existing files such as slurm-*.out generated by running SLURM jobs and the automatically created lost+found directory.
@hehe7318 hehe7318 requested review from a team as code owners June 10, 2024 15:42
@hehe7318 hehe7318 added the skip-recursive-deletion-check Skip the checks regarding the use of recursive deletion. label Jun 10, 2024
@hehe7318 hehe7318 force-pushed the wip/fix-inte-test-shared-home branch from ba0dfe8 to db3c722 Compare June 11, 2024 17:25
hehe7318 added 2 commits June 14, 2024 14:56
…d of using overcomplicating loop. Apply the same logic to restore_internal_use_shared_data.rb and config_default_user_home.rb
@hehe7318 hehe7318 changed the title [develop] Exclude existing files and directories in /home during data sync and integrity check [develop] Add logic to assert that the output of the diff command must do not contain lines starting with "Only in /tmp/home" Jun 17, 2024
@@ -29,11 +29,13 @@
expect(chef_run).to run_bash("Verify data integrity for #{user_home}").with(
code: <<-CODE
diff_output=$(diff -r #{user_home} #{user_local_home})
if [ $? -eq 0 ]; then
if [[ $diff_output != *"Only in #{user_home}"* ]]; then
rm -rf /tmp#{user_home}
rm -rf #{user_home}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'm just noticing this rm -rf #{user_home}. This is very risky and should never be done.
We should only delete temporary folders.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a new user_local_home. Why we still need user_home which has the data user_local_home already have? This rm -rf #{user_home} is the existing logic before I created the data integrity check PR. And we have used it for a long time. So it should be okay.

Also, it's not a recipe, it's a spec test.

My concern is maybe the temp folder is useless. In this case, it's not like the restore_home_shared_data.rb, in restore_home_shared_data.rb we replace /home dir, so we need a temp folder to store data. But in this case, we can directly sync data from #{node['cluster']['cluster_user_home'] to #{node['cluster']['cluster_user_local_home']. But also the temp folder is the existing logic before I created the data integrity check PR.

Copy link
Contributor

@gmarciani gmarciani Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a new user_local_home. Why we still need user_home which has the data user_local_home already have?

We should avoid to do rm -rf on home directories because we do not want to take any risk of data loss, regardless if we need the data or not. We should not take the responsibility of removing /home data.

What if, for whatever race condition, a remote file system is still attached to the original home directory?
In that case that rm -rf would remove everything from the remote file system and we would cause data loss.
We can say that there exist already a mitigation to this case: this recipe is executed before the recipes that mount the remote file systems, so we may be confident that at this point of the execution, no custom data exist yet in the home directories.

This is a good mitigation, but it's still not enough:

  1. What if there exist a corner case where a remote file system is mounted before it?
    A very improbable, but still possible, example may be a custom AMI with a init.d script that mounts a remote file system. I agree that it's really really improbable, is the value worth the risk?
  2. Furthermore, at some point we may reorder our recipes forgetting about the requirement for this recipe to be executed at the beginning, making the recursive deletion a potential source of data loss.

To sum up, we have a 99% mitigation of the data loss risk, but the potential cost of that 1% is disruptive.

In general, we should reason in terms of cost/opportunity.

  1. What the worst case cost of doing rm -rf? In the worst case it causes data loss, so potentially millions of dollars in data and reputation.
  2. What's the cost of keeping the original home directory? A potential confusion in the user that may not understand why the original home is kept. This would cost barely nothing to the cx and only potential tickets to us that could be resolved with documentation.

Also, it's not a recipe, it's a spec test.

A spec test typically reflects what the recipe is doing. If it does reflect it, then my comment applies to both the test and the recipe. If the test does not reflect the recipe, then we have a problem in the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that explain my confusion, thank you Giacomo. I totally agree with the potential risks you mentioned.

Furthermore, at some point we may reorder our recipes forgetting about the requirement for this recipe to be executed at the beginning, making the recursive deletion a potential source of data loss.

You are correct! Can not agree more.

A spec test typically reflects what the recipe is doing. If it does reflect it, then my comment applies to both the test and the recipe. If the test does not reflect the recipe, then we have a problem in the test.

I know! I was just trying to say we should look at the recipe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and add a comment for it.

@hehe7318 hehe7318 enabled auto-merge (squash) June 18, 2024 15:23
@hehe7318 hehe7318 merged commit a3cc026 into aws:develop Jun 18, 2024
30 of 31 checks passed
hgreebe pushed a commit to hgreebe/aws-parallelcluster-cookbook that referenced this pull request Aug 22, 2024
…t do not contain lines starting with "Only in /tmp/home" (aws#2748)

Fix test_shared_home integ test failure
- The `test_shared_home` test is failing caused by integrity check process in `restore_home_shared_data.rb` before delete the backup directory. Now fix the test.
    - Add logic to assert that the output of the `diff` command must do not contain lines starting with "Only in /tmp/home"
    - This fix addresses issues where the filesystem contains pre-existing files
- Apply the same logic to `restore_internal_use_shared_data.rb` and `config_default_user_home.rb` to avoid any potential risks. Also modified their spec test to adopt the new changes
- We decided to keep node['cluster']['cluster_user_home'] to avoid any potential file system data loss risks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.x skip-changelog-update skip-recursive-deletion-check Skip the checks regarding the use of recursive deletion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants