Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AiiDA configuration file sometimes gets erased. #4604

Closed
yakutovicha opened this issue Dec 1, 2020 · 24 comments
Closed

AiiDA configuration file sometimes gets erased. #4604

yakutovicha opened this issue Dec 1, 2020 · 24 comments
Assignees
Labels
Milestone

Comments

@yakutovicha
Copy link
Contributor

Describe the bug

The aiida configuration file sometimes gets erased when the machine where AiiDA is running is overloaded. More specifically, if one calls verdi and then, while it is running, stops it using the Ctrl+C there is a high risk of losing the config file.

Additional note: every time the verdi command is called, the config file timestamp changes, which suggests that the file gets overwritten.

Expected behaviour

The verdi command should not change the config file unless it is verdi config or a version update.

@ltalirz
Copy link
Member

ltalirz commented Dec 1, 2020

The problem is this line, which results in the configuration file being overwritten whenever it is successfully read using Config.from_file:

This change was introduced in 2f9224a (February 2020).

@greschd asked whether the addition of config.store() was necessary for tests to pass #3797 (review)
@sphuber can you comment?

If it was necessary for tests to pass, perhaps the tests need to be amended?
In any case, I think this line storing the config should be removed again.

@greschd
Copy link
Member

greschd commented Dec 1, 2020

Might be related to #4523, insofar as that can lead to "inconsistent" config states - and maybe that leads to storing it over and over again (just a hunch, could be wrong).

@chrisjsewell
Copy link
Member

It looks like it only needs to be stored if config_needs_migrating(config) is True?

@sphuber
Copy link
Contributor

sphuber commented Dec 1, 2020

If it was necessary for tests to pass, perhaps the tests need to be amended?
In any case, I think this line storing the config should be removed again.

No the reason is that the line before it migrates the config if necessary, but only in memory. The file on disk will remain in its old form. The line was added to persist the changes to disk. Otherwise the migration would be performed each time the config file was read. What we should do instead is to have the writing to disk only happen if an actual migration was performed.

I am not convinced however that this is the source of the bug described by @yakutovicha . The problem you describe is merely the explanation for why the config file is written each time it is read. This doesn't necessarily explain how it can be accidentally deleted.

@chrisjsewell
Copy link
Member

I thinks its quite probably the source of the bug. As already mentioned, it may be something in shutil.copy whereby if it is interrupted it leads to the loss of the file.

Additionally, this

if md5_from_filelike(handle) != md5_file(self.filepath):
self._backup(self.filepath)
shutil.copy(handle.name, self.filepath)

should be changed to

            if md5_from_filelike(handle) != md5_file(self.filepath):
                self._backup(self.filepath)
                shutil.copy(handle.name, self.filepath)

i.e. you only need to perform the copy if the hashes are different

@ltalirz
Copy link
Member

ltalirz commented Dec 1, 2020

What we should do instead is to have the writing to disk only happen if an actual migration was performed.

Ok, could you prepare a PR for this?

@ltalirz
Copy link
Member

ltalirz commented Dec 1, 2020

i.e. you only need to perform the copy if the hashes are different

While this is technically true, people might expect the file to be written if they call store() (even if the content doesn't change).
Anyhow, I think this is more a matter of taste, I think both approaches are fine.

@sphuber
Copy link
Contributor

sphuber commented Dec 1, 2020

I have opened #4605 to make sure the config only gets written to disk after loading if a migration was performed. This solves one problem, but like I said before, this is not the actual cause of the bug reported here. Of course having many more unnecessary writings of the file increases the chance of the bug happening but the actual cause has to be sought elsewhere.

A good candidate is the use of shutil.copy which is not actually atomic. Instead we should use os.rename to move the temp file into place. This should be an atomic operation. I am not sure if the use of shutil.copy could actually explain the bug reported though. Is it possible that if the copy action is interrupted the file is deleted entirely?

@ltalirz
Copy link
Member

ltalirz commented Dec 1, 2020

The aiida configuration file sometimes gets erased when the machine where AiiDA is running is overloaded

@yakutovicha Could you perhaps clarify what "erased" means here?
Do you mean that the config file was deleted or do you mean that it was replaced by an "empty" config file?

If it is the latter, the problem might instead be in the rather broad except that leads to the config file being overwritten here:

try:
with open(filepath, 'r', encoding='utf8') as handle:
config = json.load(handle)
except (IOError, OSError):
config = Config(filepath, check_and_migrate_config({}))
config.store()

P.S. We might need an "exponential backoff mechanism" for reading the config file ;-D

@sphuber
Copy link
Contributor

sphuber commented Dec 2, 2020

the rather broad except that leads to the config file being overwritten here

I wonder why you think this is a broad exception being caught? This is the standard way of catching a file that does not exist or cannot be read. Do you mean that instead of also catching the latter case, we should only be checking if the file does not already exist? There we will have similar problems with multiple processes racing to be the first to create it.

@ltalirz
Copy link
Member

ltalirz commented Dec 2, 2020

Do you mean that instead of also catching the latter case, we should only be checking if the file does not already exist?

Maybe yes?
If the file exists, but is not readable at one point in time, is the right solution really to overwrite it with an empty file?

@sphuber
Copy link
Contributor

sphuber commented Dec 2, 2020

Maybe it is indeed better to change this to FileNotFoundError. I am working on another PR that also improves the writing of the files to try and make it as atomic as possible.

@chrisjsewell
Copy link
Member

to try and make it as atomic as possible.

Randomly I was just reading this atomic_moves function in ansible, which may or may not be of interest:

https://github.com/ansible/ansible/blob/478b2687ec76f1403a4c8ffadeeab235e1281213/lib/ansible/module_utils/basic.py#L2340

@sphuber
Copy link
Contributor

sphuber commented Dec 2, 2020

Unless we really think this is necessary, I think it is worth to try and avoid having to make ansible a dependency just to have this atomic move function.

@chrisjsewell
Copy link
Member

Unless we really think this is necessary, I think it is worth to try and avoid having to make ansible a dependency just to have this atomic move function.

Yeh I wasn't suggesting making a dependency lol, it was just to note an example of how others have tackled this atomicity

@sphuber
Copy link
Contributor

sphuber commented Dec 2, 2020

Yeh I wasn't suggesting making a dependency lol, it was just to note an example of how others have tackled this atomicity

I see, thanks for the link 👍 Taking a look at the implementation does make an important point though: doing this properly across-platforms is not trivial. So it does not make sense to pretend we can implement a quick version ourselves that covers all corner cases. The question is what level of security we need though.

I was thinking of the simple following algorithm for writing the content of src to dst in a somewhat robust atomic way:

  1. Determine a temporary filename, in the same directory as where dst is to be written, that does not yet exist
  2. Write src to the temporary file and make sure it is flushed and synced
  3. Use os.rename to move dst.tmp to dst

I think this hopefully would safeguard us from many but not all potential problems. If the current writing caused the bug reported here, this change may solve that, but I don't think we can say that for sure.

@sphuber
Copy link
Contributor

sphuber commented Dec 2, 2020

I have opened a PR linked above that makes the writing to disk of the configuration file more atomic. As mentioned, I am not sure that this solves this bug. @yakutovicha how often did the bug occur? Would it be feasible to use this branch in your environment to see if it still happens? Proving a negative is difficult but I am not sure how else to test this or to confirm whether this PR can close this issue.

@yakutovicha
Copy link
Contributor Author

@yakutovicha how often did the bug occur?

@csadorf once had it. Additionally, when running a course for 14 groups (14 AiiDAlab accounts) I had two cases in 4 days.

This is all I can say now.

@sphuber I am not sure you've already taken care of it, but just in case: did you make sure the timestamp doesn't change when you run verdi?

@sphuber
Copy link
Contributor

sphuber commented Dec 2, 2020

I am not sure you've already taken care of it, but just in case: did you make sure the timestamp doesn't change when you run verdi?

Yes, this was already fixed in PR #4605 and will be released with 1.5.1

@yakutovicha could you still answer the question what you mean with "erased": is the entire file removed or is the content simply partially or fully removed?

@yakutovicha
Copy link
Contributor Author

@yakutovicha could you still answer the question what you mean with "erased": is the entire file removed or is the content simply partially or fully removed?

ah, sorry, missed that question. In my case, the file remained there, but it was empty.

@chrisjsewell
Copy link
Member

Oh that certainly makes sense that it could be shutil.copy then (overwriting rather than completely deleting) @ramirezfranciscof was this the same for you?

@ltalirz
Copy link
Member

ltalirz commented Dec 2, 2020

In my case, the file remained there, but it was empty.

And just to be sure: when you say "empty" do you mean "completely empty" or "a configuration file with no configured profiles"?

@yakutovicha
Copy link
Contributor Author

In my case, the file remained there, but it was empty.

And just to be sure: when you say "empty" do you mean "completely empty" or "a configuration file with no configured profiles"?

extremely empty.

@sphuber sphuber added this to the v1.5.1 milestone Dec 2, 2020
@sphuber
Copy link
Contributor

sphuber commented Dec 2, 2020

This was potentially addressed by #4607 so closing for now. Please reopen if the problem persists

@sphuber sphuber closed this as completed Dec 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants