Reading net charges from outSileSiesta #307

pfebrer · 2021-02-18T10:07:07Z

Instead of generating Voronoi charges as in my question 2 days ago I decided to let SIESTA do it. I then implemented a method to parse the charges from siesta's output. Is it useful enough to go in?

If you agree this could be useful, we can discuss the API. Right now it doesn't let the user decide which charges they want, it will return different things depending on how often charges have been written. You can find in this zip different output files of a 3 step MD of graphene, and following you can see how the method behaves for the different situations.

Charges at the end of the calculation

sisl.get_sile("charges_final.out").read_net_charges("voronoi")

array([-0.00598,  0.00596])

sisl.get_sile("charges_final.out").read_net_charges("hirshfeld")

array([-0.e+00, -2.e-05])

Charges at every MD step

sisl.get_sile("charges_MD_step.out").read_net_charges("voronoi")

[array([[-0.02935,  0.02934]]),
 array([[-0.01198,  0.01196]]),
 array([[-0.00598,  0.00596]])]

sisl.get_sile("charges_MD_step.out").read_net_charges("hirshfeld")

[array([[-7.e-05,  5.e-05]]),
 array([[-1.e-05, -1.e-05]]),
 array([[-0.e+00, -2.e-05]])]

Charges at every SCF step

sisl.get_sile("charges_scf_step.out").read_net_charges("voronoi")

[array([[-0.03033,  0.03032],
        [-0.02957,  0.02955],
        [-0.02954,  0.02952],
        [-0.02939,  0.02938],
        [-0.02937,  0.02936],
        [-0.02935,  0.02934],
        [-0.02935,  0.02934],
        [-0.02935,  0.02934]]), array([[-0.01225,  0.01223],
        [-0.01199,  0.01197],
        [-0.01198,  0.01196],
        [-0.01197,  0.01195],
        [-0.01198,  0.01196],
        [-0.01198,  0.01196]]), array([[-0.00581,  0.00579],
        [-0.00601,  0.00598],
        [-0.00599,  0.00597],
        [-0.00598,  0.00595],
        [-0.00598,  0.00596],
        [-0.00598,  0.00596]])]

sisl.get_sile("charges_scf_step.out").read_net_charges("hirshfeld")

[array([[-9.e-05,  8.e-05],
        [-5.e-05,  3.e-05],
        [-6.e-05,  5.e-05],
        [-8.e-05,  6.e-05],
        [-7.e-05,  6.e-05],
        [-7.e-05,  5.e-05],
        [-7.e-05,  5.e-05],
        [-7.e-05,  5.e-05]]), array([[-6.e-05,  4.e-05],
        [-2.e-05,  0.e+00],
        [-2.e-05, -0.e+00],
        [-1.e-05, -1.e-05],
        [-1.e-05, -1.e-05],
        [-1.e-05, -1.e-05]]), array([[-7.e-05,  4.e-05],
        [-2.e-05, -1.e-05],
        [-1.e-05, -2.e-05],
        [ 0.e+00, -3.e-05],
        [-0.e+00, -2.e-05],
        [-0.e+00, -2.e-05]])]

Cheers!

zerothi · 2021-02-18T10:16:55Z

Great, good idea!

We should probably name it read_charge in that way we can also add Mulliken to the list ;)
We need to add Mulliken
Possibly doing something like read_scf with flags to decide what to return (defaults should be similar to that)
The code needs to be adapted to handle non-collinear spin ;)

To make things more easy, you could use the function found, line = self.step_to("siesta: Final energy") which jumps in the file.
This should remove lots of the code ;)

pfebrer · 2021-02-18T10:30:06Z

We should probably name it read_charge in that way we can also add Mulliken to the list ;)

We need to add Mulliken

It is more complicated to generalize for Mulliken, because it can be orbital resolved, and the structure of how it's written (even if only atom resolved) is very different! :( Coupling that with the fact that sisl can already generate mulliken from the density matrix, I thought it made sense to only parse hirshfeld and voronoi, which are formatted exactly in the same way. That's also why I called it read_net_charges instead of read_charge, to leave room for a read_mulliken method. Do you think it makes sense to include mulliken under the same method?

Possibly doing something like read_scf with flags to decide what to return (defaults should be similar to that)

Yes, I thought about that too. Although there's a subtle difference. All the information of the scf is always written and in read_scf you choose which one you want to pick. With charges they may or may not be written 😅. Therefore, you might request something that is actually not written in the file.

The code needs to be adapted to handle non-collinear spin ;)

Geez, there's always an edge case for non-collinear spin hahaha. Are voronoi and hirshfeld charges written differently? Or this is just regarding mulliken?

To make things more easy, you could use the function found, line = self.step_to("siesta: Final energy") which jumps in the file.

Nice! :)

codecov · 2021-02-18T10:31:20Z

Codecov Report

Merging #307 (7597fae) into master (ca79e2e) will increase coverage by 0.02%.
The diff coverage is 88.57%.

@@            Coverage Diff             @@
##           master     #307      +/-   ##
==========================================
+ Coverage   86.87%   86.89%   +0.02%     
==========================================
  Files         269      271       +2     
  Lines       38806    39095     +289     
==========================================
+ Hits        33713    33973     +260     
- Misses       5093     5122      +29

Impacted Files	Coverage Δ
sisl/io/siesta/out.py	`76.32% <83.69%> (+3.01%)`	⬆️
sisl/io/siesta/tests/test_out_charges.py	`97.72% <97.72%> (ø)`
sisl/__init__.py	`97.36% <100.00%> (+0.07%)`	⬆️
sisl/_common.py	`100.00% <100.00%> (ø)`
sisl/tests/test_supercell.py	`100.00% <0.00%> (ø)`
sisl/linalg/tests/test_solve.py	`100.00% <0.00%> (ø)`
sisl/tests/test_sparse_orbital.py	`100.00% <0.00%> (ø)`
sisl/io/siesta/tests/test_siesta.py	`100.00% <0.00%> (ø)`
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca79e2e...7597fae. Read the comment docs.

zerothi · 2021-02-18T10:46:49Z

We should probably name it read_charge in that way we can also add Mulliken to the list ;)

We need to add Mulliken

It is more complicated to generalize for Mulliken, because it can be orbital resolved, and the structure of how it's written (even if only atom resolved) is very different! :( Coupling that with the fact that sisl can already generate mulliken from the density matrix, I thought it made sense to only parse hirshfeld and voronoi, which are formatted exactly in the same way. That's also why I called it read_net_charges instead of read_charge, to leave room for a read_mulliken method. Do you think it makes sense to include mulliken under the same method?

I agree it is much more difficult. But I think it would still fit here. So as of now, you can just stick with Hirshfeld and voronoi, and then we'll add mulliken later.
Yes, I do believe that mulliken would be fitting here :)

Possibly doing something like read_scf with flags to decide what to return (defaults should be similar to that)

Yes, I thought about that too. Although there's a subtle difference. All the information of the scf is always written and in read_scf you choose which one you want to pick. With charges they may or may not be written . Therefore, you might request something that is actually not written in the file.

Agreed, but then errors should be raises. I.e. if they are requested but not existing. In any case, H+V are only written at the end, while Mulliken can be during SCF.
This is more of an API question so users don't have to "guess" / reread manual every time.

The code needs to be adapted to handle non-collinear spin ;)

Geez, there's always an edge case for non-collinear spin hahaha. Are voronoi and hirshfeld charges written differently? Or this is just regarding mulliken?

Hehe, yeah NC changes everything ;)
It seems you have found that 4.1 and master have two different outputs? This is because in master I added NC charges. So you may check that out ;) Just some more columns.

To make things more easy, you could use the function found, line = self.step_to("siesta: Final energy") which jumps in the file.

Nice! :)

Do note that it jumps from the current position. So calling it repeatedly will still work.

pfebrer · 2021-02-18T11:03:16Z

Yes, I do believe that mulliken would be fitting here :)

You'd have to include mulliken-only arguments in the method though, wouldn't you?

In any case, H+V are only written at the end, while Mulliken can be during SCF.

Hirshfeld and Voronoi can also be written during scf with PartialChargesAtEverySCFStep true. See the "Charges at every SCF step" section in my initial comment.

Ok, I will try to add arguments to ask for specific charges. What I find most difficult is to distinguish between:

User asked for charges at every SCF iteration, there has been only 1 scf iteration.
User asked for charges at every MD step.

Maybe it would be useful to actually know what the input of the user was. Would it be reasonable to have a method to get the input info from the output file (as written at the top of the file)?

It seems you have found that 4.1 and master have two different outputs? This is because in master I added NC charges. So you may check that out ;) Just some more columns.

Ok, I will check that.

zerothi · 2021-02-18T12:08:10Z

Yes, I do believe that mulliken would be fitting here :)

You'd have to include mulliken-only arguments in the method though, wouldn't you?
Hmm.. I don't think so?
We probably we should default to return everything in a PropetryDict

I.e.

ret = PropertyDict()
ret.mulliken = PropertyDict()
ret.mulliken.orbital = ..
ret.mulliken.atom = ..sum of orbital...
ret.hirshfeld = ...
ret.voronoi = ...

hmm... Good question...

In principle the hirshfeld and voronoi could also be done for orbitals. But it isn't implemented...

In any case, H+V are only written at the end, while Mulliken can be during SCF.

Hirshfeld and Voronoi can also be written during scf with PartialChargesAtEverySCFStep true. See the "Charges at every SCF step" section in my initial comment.

Yeah, true.

Ok, I will try to add arguments to ask for specific charges. What I find most difficult is to distinguish between:

User asked for charges at every SCF iteration, there has been only 1 scf iteration.

User asked for charges at every MD step.

I think the shapes of the returned arrays should adhere to what the user asks for. If scf is requested, then the first dimension should be scf-itt. Much like the read_scf, no?

Maybe it would be useful to actually know what the input of the user was. Would it be reasonable to have a method to get the input info from the output file (as written at the top of the file)?

No, this is opening up a can of worms. Some users pipe in stuff, some put the fdf on the input line.
Also, I wouldn't like to rely too much on having both the out and fdf file present for doing any parsing... :(
Quite often the out file is retained/shared while the fdf are not.

pfebrer · 2021-02-18T15:26:07Z

Also, I wouldn't like to rely too much on having both the out and fdf file present for doing any parsing... :(
Quite often the out file is retained/shared while the fdf are not.

Knowing this, maybe it could be interesting to modify SIESTA itself to ensure that all inputs that define a run are logged in the output? 🤔

I see that SIESTA issues the following message:

A complete list of the parameters used, including default values,can be found in file out.fdf

But I don't have an out.fdf file. Anyway, couldn't this all be written in the output?

pfebrer · 2021-02-18T20:02:06Z

Ok, current status:

works for any kind of spin
works for both hirshfeld and voronoi
user can request a specific iscf and imd, the default is to return everything that can be found in the file.
option to return it as a dataframe. In my opinion, you have to be crazy to not use this option hahah.
(hopefully) meaningful errors all around if the user requested something that is not possible.

I hope I made it clear enough how the method behaves in the docstring, can you check it and let me know what do you think? Thanks!

zerothi · 2021-02-18T20:05:49Z

Also, I wouldn't like to rely too much on having both the out and fdf file present for doing any parsing... :(
Quite often the out file is retained/shared while the fdf are not.

Knowing this, maybe it could be interesting to modify SIESTA itself to ensure that all inputs that define a run are logged in the output?

Hmm. You can open an issue, or talk to Alberto about this... I am not so sure about this...

I see that SIESTA issues the following message:

A complete list of the parameters used, including default values,can be found in file out.fdf

But I don't have an out.fdf file. Anyway, couldn't this all be written in the output?

Hmm, that should probably be deleted. :)
I don't think there is a file that is one-2-one with regards to the input.

pfebrer · 2021-02-21T21:51:28Z

I've fixed an error reading the scf charges (I was reading the charges after the scf: 2 .... line, but the charges are actually printed before it).

Also, I read now just the first MD step at the beggining to understand when (if) charges are printed and then proceed to read the whole file from the beggining again. In this way I can distinguish between scf-wise and MD-step-wise charge writing, and also detect if both are turned on.

To find the separation between each MD step, I use self._md_step_final_line, maybe you want to change that (at least change its position in the code). I thought this was better than hard-coding it. In my first refactoring, I added a self.md_steps method that returned an iterator which yielded each MD block, but then I thought you wouldn't probably like it because it uses more memory. However, I think it might be a good idea, even to be used publicly, because it might facilitate custom "dirty" processing directly from users.

zerothi · 2021-02-26T14:20:16Z

Did you prepare some out files containing both the old and new format? If so, could you mail them to me?

zerothi · 2021-02-26T14:44:02Z

ps working on some other edits, please hold ;)

pfebrer · 2021-02-26T15:08:43Z

different_charges_outputs.zip

Here's a bunch of different outputs I was using for testing. The ones that end with "4.1" are the old format

Also added tests. To accommodate details of the input, we now have an enum that contains *options*. Opt.ANY can thus be used as arguments. This needs to be put in read_scf as well.

zerothi · 2021-03-05T10:36:07Z

@pfebrer could you please have a look at these changes.

Basically, I re-wrote everything.
This made me realize a few things that make it difficult to know what a users requests.

The solution is that now there exists an "option" enum object which holds

Opt.ANY
Opt.ALL
Opt.NONE

they can be bit-wise coupled and allows one to distinguish between different quantities.
Now the default is Opt.ANY for both imd and iscf which basically requests what ever is there, defaulting to the most information.

I also fixed the dataframe handling and added the tests you provided.

Is this fine for you?

pfebrer

Have you tested the dataframes? I can't use it, I get the error:

Invalid to pass a non-int64 dtype to RangeIndex

pfebrer · 2021-03-05T12:41:23Z

sisl/io/siesta/out.py

+
+            # first line is the header
+            header = (self.readline()
+                      .replace("Qatom", "q") # Qatom in 4.1, dQatom in master


Does this result in "q" for 4.1 and "dq" for master? I thought it was better to unify them with the same names. Otherwise sisl code needs to be different depending on the version of SIESTA that generated the log :(

True, I missed this, will fix this.

The point was that that column refers to dq, so I want the column to be named dq, and the population to be named e.

zerothi · 2021-03-05T13:37:35Z

Have you tested the dataframes? I can't use it, I get the error:
Invalid to pass a non-int64 dtype to RangeIndex

Which pandas version are you using? Yes, I did test it ;)

zerothi · 2021-03-05T13:46:49Z

Could you have a go now? Thanks!

pfebrer · 2021-03-05T16:21:07Z

Ok, I realised I may have a too old pandas (0.25.3), although it's been only about a year and a half since it was released 😅

Now it works even with my pandas version.

One thing that I miss with respect to how it was before is that imd=-1, iscf=-1 are not valid if there is no MD or iscf information. I thought that was quite useful, because then:

imd = -1, iscf=-1 gives you the final charges regardless of whether the charges are scf/MD resolved or not
iscf = -1 gives you the last charges of each MD step whether there is scf resolution or just per MD step resolution.

Don't you think this makes sense?

zerothi · 2021-03-05T19:53:58Z

Now it works even with my pandas version.
Ok, I also changed something there, so could possibly work with the older version.

One thing that I miss with respect to how it was before is that imd=-1, iscf=-1 are not valid if there is no MD or iscf information. I thought that was quite useful, because then:

imd = -1, iscf=-1 gives you the final charges regardless of whether the charges are scf/MD resolved or not

iscf = -1 gives you the last charges of each MD step whether there is scf resolution or just per MD step resolution.

Don't you think this makes sense?

While that may in some cases be the case, I don't think it is a good idea. The charges could potentially change between the SCF and MD charges due to some mixing schemes etc.
With this in mind I think it is better to be safe than sorry. Of course the current Opt.ANY is rather vague since the returned data can be 1 of three variants. Quite annoying. I don't see a good solution to this. Ideally Siesta will eventually write usable data in a stricter format (yaml/json or whatever).

pfebrer · 2021-03-05T20:20:47Z

How can I ask for the final charges now then?

zerothi · 2021-03-05T20:29:52Z

Yeah, I probably need to go through the documentation again. iscf=None, imd=None. That specifically asks for not md and not scf.

zerothi · 2021-03-08T11:53:41Z

Could you have a look at the documentation now? I think it is ready?

pfebrer · 2021-03-08T12:26:52Z

Great, I also think this is good to go!

pfebrer force-pushed the read_net_charges branch from 145ed0e to b0da498 Compare February 21, 2021 21:42

pfebrer force-pushed the read_net_charges branch from b0da498 to c1addeb Compare February 21, 2021 21:53

pfebrer and others added 4 commits February 26, 2021 15:11

enh: read net charges outSileSiesta

479e369

enh: outSileSiesta read charge spin and dataframe

c94f629

maint: refactoring outSileSiesta read_charge

6d942e1

mnt: restructured for step_to

6096b4a

zerothi force-pushed the read_net_charges branch from c1addeb to 6096b4a Compare February 26, 2021 14:13

mnt: change read_charge for more versatility and clarity

f51133f

Also added tests. To accommodate details of the input, we now have an enum that contains *options*. Opt.ANY can thus be used as arguments. This needs to be put in read_scf as well.

pfebrer commented Mar 5, 2021

View reviewed changes

mnt: fixed rangeindex and 4.1 vs master

36847aa

doc: updated documentation

7597fae

zerothi merged commit ba2a02a into zerothi:master Mar 8, 2021

pfebrer deleted the read_net_charges branch March 31, 2021 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading net charges from outSileSiesta #307

Reading net charges from outSileSiesta #307

pfebrer commented Feb 18, 2021 •

edited

Loading

zerothi commented Feb 18, 2021

pfebrer commented Feb 18, 2021

codecov bot commented Feb 18, 2021 •

edited

Loading

zerothi commented Feb 18, 2021

pfebrer commented Feb 18, 2021 •

edited

Loading

zerothi commented Feb 18, 2021

pfebrer commented Feb 18, 2021

pfebrer commented Feb 18, 2021

zerothi commented Feb 18, 2021

pfebrer commented Feb 21, 2021 •

edited

Loading

zerothi commented Feb 26, 2021

zerothi commented Feb 26, 2021

pfebrer commented Feb 26, 2021

zerothi commented Mar 5, 2021

pfebrer left a comment •

edited by zerothi

Loading

pfebrer Mar 5, 2021

zerothi Mar 5, 2021

zerothi commented Mar 5, 2021

zerothi commented Mar 5, 2021

pfebrer commented Mar 5, 2021

zerothi commented Mar 5, 2021

pfebrer commented Mar 5, 2021

zerothi commented Mar 5, 2021

zerothi commented Mar 8, 2021

pfebrer commented Mar 8, 2021

Reading net charges from outSileSiesta #307

Reading net charges from outSileSiesta #307

Conversation

pfebrer commented Feb 18, 2021 • edited Loading

Charges at the end of the calculation

Charges at every MD step

Charges at every SCF step

zerothi commented Feb 18, 2021

pfebrer commented Feb 18, 2021

codecov bot commented Feb 18, 2021 • edited Loading

Codecov Report

zerothi commented Feb 18, 2021

pfebrer commented Feb 18, 2021 • edited Loading

zerothi commented Feb 18, 2021

pfebrer commented Feb 18, 2021

pfebrer commented Feb 18, 2021

zerothi commented Feb 18, 2021

pfebrer commented Feb 21, 2021 • edited Loading

zerothi commented Feb 26, 2021

zerothi commented Feb 26, 2021

pfebrer commented Feb 26, 2021

zerothi commented Mar 5, 2021

pfebrer left a comment • edited by zerothi Loading

Choose a reason for hiding this comment

pfebrer Mar 5, 2021

Choose a reason for hiding this comment

zerothi Mar 5, 2021

Choose a reason for hiding this comment

zerothi commented Mar 5, 2021

zerothi commented Mar 5, 2021

pfebrer commented Mar 5, 2021

zerothi commented Mar 5, 2021

pfebrer commented Mar 5, 2021

zerothi commented Mar 5, 2021

zerothi commented Mar 8, 2021

pfebrer commented Mar 8, 2021

pfebrer commented Feb 18, 2021 •

edited

Loading

codecov bot commented Feb 18, 2021 •

edited

Loading

pfebrer commented Feb 18, 2021 •

edited

Loading

pfebrer commented Feb 21, 2021 •

edited

Loading

pfebrer left a comment •

edited by zerothi

Loading