Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Add logic to assert that the output of the diff command must do not contain lines starting with "Only in /tmp/home" #2748

Merged
merged 16 commits into from
Jun 18, 2024
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -63,19 +63,21 @@
end

# Data integrity check and cleanup for temporary backup and original home directory
# 1. Verifies data integrity by comparing the temporary backup directory and the new local home directory.
# 1. Verifies data integrity by comparing the original home directory and the new local home directory.
# 2. If the data integrity check passes, it removes both the temporary backup directory and the original home directory.
# 3. If the data integrity check fails, it outputs an error message and exits with an error code 1.
bash "Verify data integrity for #{node['cluster']['cluster_user_home']}" do
user 'root'
group 'root'
code <<-EOH
diff_output=$(diff -r #{node['cluster']['cluster_user_home']} #{node['cluster']['cluster_user_local_home']})
if [ $? -eq 0 ]; then
if [[ $diff_output != *"Only in #{node['cluster']['cluster_user_home']}"* ]]; then
gmarciani marked this conversation as resolved.
Show resolved Hide resolved
gmarciani marked this conversation as resolved.
Show resolved Hide resolved
rm -rf /tmp#{node['cluster']['cluster_user_home']}
rm -rf #{node['cluster']['cluster_user_home']}
else
echo "Data integrity check failed comparing #{node['cluster']['cluster_user_local_home']} and #{node['cluster']['cluster_user_home']}: $diff_output" >&2
only_in_cluster_user_home=$(echo "$diff_output" | grep "Only in #{node['cluster']['cluster_user_home']}")
echo "Data integrity check failed comparing #{node['cluster']['cluster_user_local_home']} and #{node['cluster']['cluster_user_home']}. Differences:"
echo "$only_in_cluster_user_home"
systemctl start sshd
exit 1
fi
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
# This is necessary to preserve any data in these directories that was
# generated during the build of ParallelCluster AMIs after converting to
gmarciani marked this conversation as resolved.
Show resolved Hide resolved
# shared storage and backed up to a temporary location previously
# Before removing the backup, ensure the data in the new home is the same
# Before removing the backup, ensure the new data in the new home is the same
# as the original to avoid any data loss or inconsistency. This is done
# by using rsync to copy the data and diff to check for differences.
# Remove the backup after the copy is done and the data integrity is verified.
Expand All @@ -29,10 +29,12 @@
code <<-EOH
rsync -a --ignore-existing /tmp/home/ /home
diff_output=$(diff -r /tmp/home/ /home)
if [ $? -eq 0 ]; then
if [[ $diff_output != *"Only in /tmp/home"* ]]; then
rm -rf /tmp/home/
else
echo "Data integrity check failed comparing /home and /tmp/home: $diff_output"
only_in_tmp=$(echo "$diff_output" | grep "Only in /tmp/home")
echo "Data integrity check failed comparing /home and /tmp/home. Differences:"
echo "$only_in_tmp"
exit 1
fi
EOH
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
# This is necessary to preserve any data in these directories that was
# generated during the build of ParallelCluster AMIs after converting to
# shared storage and backed up to a temporary location previously
# Before removing the backup, ensure the data in the new directory is the same
# Before removing the backup, ensure the new data in the new directory is the same
# as the original to avoid any data loss or inconsistency. This is done
# by using rsync to copy the data and diff to check for differences.
# Remove the backup after the copy is done and the data integrity is verified.
Expand All @@ -30,10 +30,12 @@
code <<-EOH
rsync -a --ignore-existing /tmp#{dir}/ #{dir}
diff_output=$(diff -r /tmp#{dir}/ #{dir})
if [ $? -eq 0 ]; then
if [[ $diff_output != *"Only in /tmp#{dir}"* ]]; then
rm -rf /tmp#{dir}/
else
echo "Data integrity check failed comparing #{dir} and /tmp#{dir}: $diff_output"
only_in_tmp=$(echo "$diff_output" | grep "Only in /tmp#{dir}")
echo "Data integrity check failed comparing #{dir} and /tmp#{dir}. Differences:"
echo "$only_in_tmp"
exit 1
fi
EOH
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,13 @@
expect(chef_run).to run_bash("Verify data integrity for #{user_home}").with(
code: <<-CODE
diff_output=$(diff -r #{user_home} #{user_local_home})
if [ $? -eq 0 ]; then
if [[ $diff_output != *"Only in #{user_home}"* ]]; then
gmarciani marked this conversation as resolved.
Show resolved Hide resolved
rm -rf /tmp#{user_home}
rm -rf #{user_home}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'm just noticing this rm -rf #{user_home}. This is very risky and should never be done.
We should only delete temporary folders.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a new user_local_home. Why we still need user_home which has the data user_local_home already have? This rm -rf #{user_home} is the existing logic before I created the data integrity check PR. And we have used it for a long time. So it should be okay.

Also, it's not a recipe, it's a spec test.

My concern is maybe the temp folder is useless. In this case, it's not like the restore_home_shared_data.rb, in restore_home_shared_data.rb we replace /home dir, so we need a temp folder to store data. But in this case, we can directly sync data from #{node['cluster']['cluster_user_home'] to #{node['cluster']['cluster_user_local_home']. But also the temp folder is the existing logic before I created the data integrity check PR.

Copy link
Contributor

@gmarciani gmarciani Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a new user_local_home. Why we still need user_home which has the data user_local_home already have?

We should avoid to do rm -rf on home directories because we do not want to take any risk of data loss, regardless if we need the data or not. We should not take the responsibility of removing /home data.

What if, for whatever race condition, a remote file system is still attached to the original home directory?
In that case that rm -rf would remove everything from the remote file system and we would cause data loss.
We can say that there exist already a mitigation to this case: this recipe is executed before the recipes that mount the remote file systems, so we may be confident that at this point of the execution, no custom data exist yet in the home directories.

This is a good mitigation, but it's still not enough:

  1. What if there exist a corner case where a remote file system is mounted before it?
    A very improbable, but still possible, example may be a custom AMI with a init.d script that mounts a remote file system. I agree that it's really really improbable, is the value worth the risk?
  2. Furthermore, at some point we may reorder our recipes forgetting about the requirement for this recipe to be executed at the beginning, making the recursive deletion a potential source of data loss.

To sum up, we have a 99% mitigation of the data loss risk, but the potential cost of that 1% is disruptive.

In general, we should reason in terms of cost/opportunity.

  1. What the worst case cost of doing rm -rf? In the worst case it causes data loss, so potentially millions of dollars in data and reputation.
  2. What's the cost of keeping the original home directory? A potential confusion in the user that may not understand why the original home is kept. This would cost barely nothing to the cx and only potential tickets to us that could be resolved with documentation.

Also, it's not a recipe, it's a spec test.

A spec test typically reflects what the recipe is doing. If it does reflect it, then my comment applies to both the test and the recipe. If the test does not reflect the recipe, then we have a problem in the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that explain my confusion, thank you Giacomo. I totally agree with the potential risks you mentioned.

Furthermore, at some point we may reorder our recipes forgetting about the requirement for this recipe to be executed at the beginning, making the recursive deletion a potential source of data loss.

You are correct! Can not agree more.

A spec test typically reflects what the recipe is doing. If it does reflect it, then my comment applies to both the test and the recipe. If the test does not reflect the recipe, then we have a problem in the test.

I know! I was just trying to say we should look at the recipe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and add a comment for it.

else
echo "Data integrity check failed comparing #{user_local_home} and #{user_home}: $diff_output" >&2
only_in_cluster_user_home=$(echo "$diff_output" | grep "Only in #{user_home}")
echo "Data integrity check failed comparing #{user_local_home} and #{user_home}. Differences:"
echo "$only_in_cluster_user_home"
systemctl start sshd
exit 1
fi
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,12 @@
code: <<-CODE
rsync -a --ignore-existing /tmp#{dir}/ #{dir}
diff_output=$(diff -r /tmp#{dir}/ #{dir})
if [ $? -eq 0 ]; then
if [[ $diff_output != *"Only in /tmp#{dir}"* ]]; then
rm -rf /tmp#{dir}/
else
echo "Data integrity check failed comparing #{dir} and /tmp#{dir}: $diff_output"
only_in_tmp=$(echo "$diff_output" | grep "Only in /tmp#{dir}")
echo "Data integrity check failed comparing #{dir} and /tmp#{dir}. Differences:"
echo "$only_in_tmp"
exit 1
fi
CODE
Expand Down
Loading