Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading net charges from outSileSiesta #307

Merged
merged 7 commits into from
Mar 8, 2021

Conversation

pfebrer
Copy link
Contributor

@pfebrer pfebrer commented Feb 18, 2021

Instead of generating Voronoi charges as in my question 2 days ago I decided to let SIESTA do it. I then implemented a method to parse the charges from siesta's output. Is it useful enough to go in?

If you agree this could be useful, we can discuss the API. Right now it doesn't let the user decide which charges they want, it will return different things depending on how often charges have been written. You can find in this zip different output files of a 3 step MD of graphene, and following you can see how the method behaves for the different situations.

Charges at the end of the calculation

sisl.get_sile("charges_final.out").read_net_charges("voronoi")
array([-0.00598,  0.00596])
sisl.get_sile("charges_final.out").read_net_charges("hirshfeld")
array([-0.e+00, -2.e-05])

Charges at every MD step

sisl.get_sile("charges_MD_step.out").read_net_charges("voronoi")
[array([[-0.02935,  0.02934]]),
 array([[-0.01198,  0.01196]]),
 array([[-0.00598,  0.00596]])]
sisl.get_sile("charges_MD_step.out").read_net_charges("hirshfeld")
[array([[-7.e-05,  5.e-05]]),
 array([[-1.e-05, -1.e-05]]),
 array([[-0.e+00, -2.e-05]])]

Charges at every SCF step

sisl.get_sile("charges_scf_step.out").read_net_charges("voronoi")
[array([[-0.03033,  0.03032],
        [-0.02957,  0.02955],
        [-0.02954,  0.02952],
        [-0.02939,  0.02938],
        [-0.02937,  0.02936],
        [-0.02935,  0.02934],
        [-0.02935,  0.02934],
        [-0.02935,  0.02934]]), array([[-0.01225,  0.01223],
        [-0.01199,  0.01197],
        [-0.01198,  0.01196],
        [-0.01197,  0.01195],
        [-0.01198,  0.01196],
        [-0.01198,  0.01196]]), array([[-0.00581,  0.00579],
        [-0.00601,  0.00598],
        [-0.00599,  0.00597],
        [-0.00598,  0.00595],
        [-0.00598,  0.00596],
        [-0.00598,  0.00596]])]
sisl.get_sile("charges_scf_step.out").read_net_charges("hirshfeld")
[array([[-9.e-05,  8.e-05],
        [-5.e-05,  3.e-05],
        [-6.e-05,  5.e-05],
        [-8.e-05,  6.e-05],
        [-7.e-05,  6.e-05],
        [-7.e-05,  5.e-05],
        [-7.e-05,  5.e-05],
        [-7.e-05,  5.e-05]]), array([[-6.e-05,  4.e-05],
        [-2.e-05,  0.e+00],
        [-2.e-05, -0.e+00],
        [-1.e-05, -1.e-05],
        [-1.e-05, -1.e-05],
        [-1.e-05, -1.e-05]]), array([[-7.e-05,  4.e-05],
        [-2.e-05, -1.e-05],
        [-1.e-05, -2.e-05],
        [ 0.e+00, -3.e-05],
        [-0.e+00, -2.e-05],
        [-0.e+00, -2.e-05]])]

Cheers!

@zerothi
Copy link
Owner

zerothi commented Feb 18, 2021

Great, good idea!

  1. We should probably name it read_charge in that way we can also add Mulliken to the list ;)
  2. We need to add Mulliken
  3. Possibly doing something like read_scf with flags to decide what to return (defaults should be similar to that)
  4. The code needs to be adapted to handle non-collinear spin ;)

To make things more easy, you could use the function found, line = self.step_to("siesta: Final energy") which jumps in the file.
This should remove lots of the code ;)

@pfebrer
Copy link
Contributor Author

pfebrer commented Feb 18, 2021

  1. We should probably name it read_charge in that way we can also add Mulliken to the list ;)
  2. We need to add Mulliken

It is more complicated to generalize for Mulliken, because it can be orbital resolved, and the structure of how it's written (even if only atom resolved) is very different! :( Coupling that with the fact that sisl can already generate mulliken from the density matrix, I thought it made sense to only parse hirshfeld and voronoi, which are formatted exactly in the same way. That's also why I called it read_net_charges instead of read_charge, to leave room for a read_mulliken method. Do you think it makes sense to include mulliken under the same method?

Possibly doing something like read_scf with flags to decide what to return (defaults should be similar to that)

Yes, I thought about that too. Although there's a subtle difference. All the information of the scf is always written and in read_scf you choose which one you want to pick. With charges they may or may not be written 😅. Therefore, you might request something that is actually not written in the file.

The code needs to be adapted to handle non-collinear spin ;)

Geez, there's always an edge case for non-collinear spin hahaha. Are voronoi and hirshfeld charges written differently? Or this is just regarding mulliken?

To make things more easy, you could use the function found, line = self.step_to("siesta: Final energy") which jumps in the file.

Nice! :)

@codecov
Copy link

codecov bot commented Feb 18, 2021

Codecov Report

Merging #307 (7597fae) into master (ca79e2e) will increase coverage by 0.02%.
The diff coverage is 88.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #307      +/-   ##
==========================================
+ Coverage   86.87%   86.89%   +0.02%     
==========================================
  Files         269      271       +2     
  Lines       38806    39095     +289     
==========================================
+ Hits        33713    33973     +260     
- Misses       5093     5122      +29     
Impacted Files Coverage Δ
sisl/io/siesta/out.py 76.32% <83.69%> (+3.01%) ⬆️
sisl/io/siesta/tests/test_out_charges.py 97.72% <97.72%> (ø)
sisl/__init__.py 97.36% <100.00%> (+0.07%) ⬆️
sisl/_common.py 100.00% <100.00%> (ø)
sisl/tests/test_supercell.py 100.00% <0.00%> (ø)
sisl/linalg/tests/test_solve.py 100.00% <0.00%> (ø)
sisl/tests/test_sparse_orbital.py 100.00% <0.00%> (ø)
sisl/io/siesta/tests/test_siesta.py 100.00% <0.00%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca79e2e...7597fae. Read the comment docs.

@zerothi
Copy link
Owner

zerothi commented Feb 18, 2021

  1. We should probably name it read_charge in that way we can also add Mulliken to the list ;)
  2. We need to add Mulliken

It is more complicated to generalize for Mulliken, because it can be orbital resolved, and the structure of how it's written (even if only atom resolved) is very different! :( Coupling that with the fact that sisl can already generate mulliken from the density matrix, I thought it made sense to only parse hirshfeld and voronoi, which are formatted exactly in the same way. That's also why I called it read_net_charges instead of read_charge, to leave room for a read_mulliken method. Do you think it makes sense to include mulliken under the same method?

I agree it is much more difficult. But I think it would still fit here. So as of now, you can just stick with Hirshfeld and voronoi, and then we'll add mulliken later.
Yes, I do believe that mulliken would be fitting here :)

Possibly doing something like read_scf with flags to decide what to return (defaults should be similar to that)

Yes, I thought about that too. Although there's a subtle difference. All the information of the scf is always written and in read_scf you choose which one you want to pick. With charges they may or may not be written . Therefore, you might request something that is actually not written in the file.

Agreed, but then errors should be raises. I.e. if they are requested but not existing. In any case, H+V are only written at the end, while Mulliken can be during SCF.
This is more of an API question so users don't have to "guess" / reread manual every time.

The code needs to be adapted to handle non-collinear spin ;)

Geez, there's always an edge case for non-collinear spin hahaha. Are voronoi and hirshfeld charges written differently? Or this is just regarding mulliken?

Hehe, yeah NC changes everything ;)
It seems you have found that 4.1 and master have two different outputs? This is because in master I added NC charges. So you may check that out ;) Just some more columns.

To make things more easy, you could use the function found, line = self.step_to("siesta: Final energy") which jumps in the file.

Nice! :)

Do note that it jumps from the current position. So calling it repeatedly will still work.

@pfebrer
Copy link
Contributor Author

pfebrer commented Feb 18, 2021

Yes, I do believe that mulliken would be fitting here :)

You'd have to include mulliken-only arguments in the method though, wouldn't you?

In any case, H+V are only written at the end, while Mulliken can be during SCF.

Hirshfeld and Voronoi can also be written during scf with PartialChargesAtEverySCFStep true. See the "Charges at every SCF step" section in my initial comment.

Ok, I will try to add arguments to ask for specific charges. What I find most difficult is to distinguish between:

  • User asked for charges at every SCF iteration, there has been only 1 scf iteration.
  • User asked for charges at every MD step.

Maybe it would be useful to actually know what the input of the user was. Would it be reasonable to have a method to get the input info from the output file (as written at the top of the file)?

It seems you have found that 4.1 and master have two different outputs? This is because in master I added NC charges. So you may check that out ;) Just some more columns.

Ok, I will check that.

@zerothi
Copy link
Owner

zerothi commented Feb 18, 2021

Yes, I do believe that mulliken would be fitting here :)

You'd have to include mulliken-only arguments in the method though, wouldn't you?
Hmm.. I don't think so?
We probably we should default to return everything in a PropetryDict

I.e.

ret = PropertyDict()
ret.mulliken = PropertyDict()
ret.mulliken.orbital = ..
ret.mulliken.atom = ..sum of orbital...
ret.hirshfeld = ...
ret.voronoi = ...

hmm... Good question...

In principle the hirshfeld and voronoi could also be done for orbitals. But it isn't implemented...

In any case, H+V are only written at the end, while Mulliken can be during SCF.

Hirshfeld and Voronoi can also be written during scf with PartialChargesAtEverySCFStep true. See the "Charges at every SCF step" section in my initial comment.

Yeah, true.

Ok, I will try to add arguments to ask for specific charges. What I find most difficult is to distinguish between:

  • User asked for charges at every SCF iteration, there has been only 1 scf iteration.
  • User asked for charges at every MD step.

I think the shapes of the returned arrays should adhere to what the user asks for. If scf is requested, then the first dimension should be scf-itt. Much like the read_scf, no?

Maybe it would be useful to actually know what the input of the user was. Would it be reasonable to have a method to get the input info from the output file (as written at the top of the file)?

No, this is opening up a can of worms. Some users pipe in stuff, some put the fdf on the input line.
Also, I wouldn't like to rely too much on having both the out and fdf file present for doing any parsing... :(
Quite often the out file is retained/shared while the fdf are not.

@pfebrer
Copy link
Contributor Author

pfebrer commented Feb 18, 2021

Also, I wouldn't like to rely too much on having both the out and fdf file present for doing any parsing... :(
Quite often the out file is retained/shared while the fdf are not.

Knowing this, maybe it could be interesting to modify SIESTA itself to ensure that all inputs that define a run are logged in the output? 🤔

I see that SIESTA issues the following message:

A complete list of the parameters used, including default values,can be found in file out.fdf

But I don't have an out.fdf file. Anyway, couldn't this all be written in the output?

@pfebrer
Copy link
Contributor Author

pfebrer commented Feb 18, 2021

Ok, current status:

  • works for any kind of spin
  • works for both hirshfeld and voronoi
  • user can request a specific iscf and imd, the default is to return everything that can be found in the file.
  • option to return it as a dataframe. In my opinion, you have to be crazy to not use this option hahah.
  • (hopefully) meaningful errors all around if the user requested something that is not possible.

I hope I made it clear enough how the method behaves in the docstring, can you check it and let me know what do you think? Thanks!

@zerothi
Copy link
Owner

zerothi commented Feb 18, 2021

Also, I wouldn't like to rely too much on having both the out and fdf file present for doing any parsing... :(
Quite often the out file is retained/shared while the fdf are not.

Knowing this, maybe it could be interesting to modify SIESTA itself to ensure that all inputs that define a run are logged in the output?

Hmm. You can open an issue, or talk to Alberto about this... I am not so sure about this...

I see that SIESTA issues the following message:

A complete list of the parameters used, including default values,can be found in file out.fdf

But I don't have an out.fdf file. Anyway, couldn't this all be written in the output?

Hmm, that should probably be deleted. :)
I don't think there is a file that is one-2-one with regards to the input.

@pfebrer
Copy link
Contributor Author

pfebrer commented Feb 21, 2021

I've fixed an error reading the scf charges (I was reading the charges after the scf: 2 .... line, but the charges are actually printed before it).

Also, I read now just the first MD step at the beggining to understand when (if) charges are printed and then proceed to read the whole file from the beggining again. In this way I can distinguish between scf-wise and MD-step-wise charge writing, and also detect if both are turned on.

To find the separation between each MD step, I use self._md_step_final_line, maybe you want to change that (at least change its position in the code). I thought this was better than hard-coding it. In my first refactoring, I added a self.md_steps method that returned an iterator which yielded each MD block, but then I thought you wouldn't probably like it because it uses more memory. However, I think it might be a good idea, even to be used publicly, because it might facilitate custom "dirty" processing directly from users.

@zerothi
Copy link
Owner

zerothi commented Feb 26, 2021

Did you prepare some out files containing both the old and new format? If so, could you mail them to me?

@zerothi
Copy link
Owner

zerothi commented Feb 26, 2021

ps working on some other edits, please hold ;)

@pfebrer
Copy link
Contributor Author

pfebrer commented Feb 26, 2021

different_charges_outputs.zip

Here's a bunch of different outputs I was using for testing. The ones that end with "4.1" are the old format

Also added tests.

To accommodate details of the input, we now have an enum
that contains *options*.

Opt.ANY can thus be used as arguments.
This needs to be put in read_scf as well.
@zerothi
Copy link
Owner

zerothi commented Mar 5, 2021

@pfebrer could you please have a look at these changes.

Basically, I re-wrote everything.
This made me realize a few things that make it difficult to know what a users requests.

The solution is that now there exists an "option" enum object which holds

Opt.ANY
Opt.ALL
Opt.NONE

they can be bit-wise coupled and allows one to distinguish between different quantities.
Now the default is Opt.ANY for both imd and iscf which basically requests what ever is there, defaulting to the most information.

I also fixed the dataframe handling and added the tests you provided.

Is this fine for you?

Copy link
Contributor Author

@pfebrer pfebrer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested the dataframes? I can't use it, I get the error:

Invalid to pass a non-int64 dtype to RangeIndex


# first line is the header
header = (self.readline()
.replace("Qatom", "q") # Qatom in 4.1, dQatom in master
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this result in "q" for 4.1 and "dq" for master? I thought it was better to unify them with the same names. Otherwise sisl code needs to be different depending on the version of SIESTA that generated the log :(

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I missed this, will fix this.

The point was that that column refers to dq, so I want the column to be named dq, and the population to be named e.

@zerothi
Copy link
Owner

zerothi commented Mar 5, 2021

Have you tested the dataframes? I can't use it, I get the error:

Invalid to pass a non-int64 dtype to RangeIndex

Which pandas version are you using? Yes, I did test it ;)

@zerothi
Copy link
Owner

zerothi commented Mar 5, 2021

Could you have a go now? Thanks!

@pfebrer
Copy link
Contributor Author

pfebrer commented Mar 5, 2021

Ok, I realised I may have a too old pandas (0.25.3), although it's been only about a year and a half since it was released 😅

Now it works even with my pandas version.

One thing that I miss with respect to how it was before is that imd=-1, iscf=-1 are not valid if there is no MD or iscf information. I thought that was quite useful, because then:

  • imd = -1, iscf=-1 gives you the final charges regardless of whether the charges are scf/MD resolved or not
  • iscf = -1 gives you the last charges of each MD step whether there is scf resolution or just per MD step resolution.

Don't you think this makes sense?

@zerothi
Copy link
Owner

zerothi commented Mar 5, 2021

Now it works even with my pandas version.
Ok, I also changed something there, so could possibly work with the older version.

One thing that I miss with respect to how it was before is that imd=-1, iscf=-1 are not valid if there is no MD or iscf information. I thought that was quite useful, because then:

  • imd = -1, iscf=-1 gives you the final charges regardless of whether the charges are scf/MD resolved or not
  • iscf = -1 gives you the last charges of each MD step whether there is scf resolution or just per MD step resolution.

Don't you think this makes sense?

While that may in some cases be the case, I don't think it is a good idea. The charges could potentially change between the SCF and MD charges due to some mixing schemes etc.
With this in mind I think it is better to be safe than sorry. Of course the current Opt.ANY is rather vague since the returned data can be 1 of three variants. Quite annoying. I don't see a good solution to this. Ideally Siesta will eventually write usable data in a stricter format (yaml/json or whatever).

@pfebrer
Copy link
Contributor Author

pfebrer commented Mar 5, 2021

How can I ask for the final charges now then?

@zerothi
Copy link
Owner

zerothi commented Mar 5, 2021

Yeah, I probably need to go through the documentation again. iscf=None, imd=None. That specifically asks for not md and not scf.

@zerothi
Copy link
Owner

zerothi commented Mar 8, 2021

Could you have a look at the documentation now? I think it is ready?

@pfebrer
Copy link
Contributor Author

pfebrer commented Mar 8, 2021

Great, I also think this is good to go!

@zerothi zerothi merged commit ba2a02a into zerothi:master Mar 8, 2021
@pfebrer pfebrer deleted the read_net_charges branch March 31, 2021 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants