Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using sisl to parse aiida-siesta output #484

Closed
ahkole opened this issue Sep 2, 2022 · 38 comments · Fixed by #494
Closed

Using sisl to parse aiida-siesta output #484

ahkole opened this issue Sep 2, 2022 · 38 comments · Fixed by #494
Milestone

Comments

@ahkole
Copy link
Contributor

ahkole commented Sep 2, 2022

Describe the feature
I am trying out AiiDA to manage my workflow with siesta. I really like what the aiida-siesta (https://docs.siesta-project.org/projects/aiida-siesta/en/latest/) has already implemented in terms of calculations and workflows. This plugin also has a simple parser for parsing some of the basic siesta output files.I would like to extend the functionality to also be able to parse information from other siesta output files. Sisl has a large collection of parsers available so it would be great if I could use these to parse the output files produced by siesta. However, I haven't found a good robust way yet to use sisl for this. AiiDA stores retrieved output files from simulations in a file repository and gives you access to these files through a FolderData API (https://aiida.readthedocs.io/projects/aiida-core/en/stable/topics/data_types.html#topics-data-types-core-folder). This API is able to provide a file handle object for reading or can simply read and return the entire content of the file. On the other hand, to use sisl for parsing a file you need to supply the filename of the file and sisl handles the rest. Unfortunately I don't have access to the filename, just to either a file handle or the complete file content. Is there a way to make sisl work with AiiDA retrieved files? So far I have come up with a couple of potential solutions:

  1. Write the contents of the AiiDA retrieved file to a temporary file in the current working directory with the right extension, use get_sile to parse this temporary file and then remove it. This should work out of the box but involves a lot of unnecessary IO. This also probably becomes unnecessarily slow if you have large output files.
  2. Modify sisl to not only accept filenames but also the entire file contents as a string (or something equivalent for binary files). You could then maybe use the StringIO and BytesIO classes to provide a file like interface to the other methods in the sile. The big disadvantage from this is that you cannot derive the filetype from the extension so you would have to supply it some other way. And this may become slow and require tons of memory (because the entire file needs to be loaded into memory) if you have large files.
  3. Modify sisl to accept a file handle instead of a filepath. This to me seems like the nicest option. You still need some way to determine the file type (i.e. you could give a filepath with extension in addition to the file handle (the name of the original file is still available from within AiiDA so this would be easy) or there could be some other input parameter for this).

What do you think? Is there a way that sisl could be used for parsing files retrieved by AiiDA? And if so, do any of the above solutions look like a good one? Or do you have a better proposal? I would also be happy to help with any implementation and/or testing once we have settled on an approach.

@zerothi
Copy link
Owner

zerothi commented Sep 8, 2022

Thanks for your request!

Indeed this would be nice to have.
@pfebrer might be interested in this discussion as he has requested something similar.

I think one way to easily do this is to add a __new__ method which bypasses the __enter__ and __exit__ methods and fixes the internal file-handle. Additionally the ascii siles should probably inherit from TextIOBase to make them work, but lets see about that.

I think your option 3 would be the best direction.

@pfebrer I just thought about this way of achieving this, what do you think?

To test this, I would just inherit one class and add the __new__ method and see how it works, that should make it fast to debug and implement. Let me know if this could be a sustainable work-around?

@pfebrer
Copy link
Contributor

pfebrer commented Sep 8, 2022

Yes, we already discussed something like this, and Jonas proposed a workaround, here: #205 (comment).

Recently I also proposed that it would be nice to be able to avoid the Sile framework altogether and separate parsing functions from the actual sile class so that they can be reused in any situation, here: #483 (comment).

In the case of not being able to avoid using siles, probably something like sisl.get_sile(io_stream, cls=sisl.io.whateverSile) would be my preferred interface, where io_stream is a file handle, StringIO, BytesIO or any other stream that can be read.

StringIO and BytesIO are probably more difficult to implement, but with file handles it should be fairly easy, no? Instead of __new__, maybe we could add the new method, as in Geometry.new? Messing with object initialization can lead to obscure behavior if we keep making it more complex in the future.

@ahkole
Copy link
Contributor Author

ahkole commented Sep 9, 2022

I have been playing around a bit with this and I have something which sort of works for the ASCII siles. The way I have implemented it now is by modifying the constructor of the Sile class to look for a file_handle argument in the kwargs. If it finds this, it sets an attribute self._ext_fh to point to this external file-handle. Furthermore, I made a small modification to the _open function to first check if the _ext_fh attribute exists and if it does, it uses this external file-handle instead of opening a file based on the filename provided.

While working on this I ran into the issue that some of the read_something routines use the sile object as an iterator. Once the read routine exits, the iterator is garbage collected and this calls the close() function on the file-handle making it unusable for further reads. My current workaround for this is to copy the entire contents of the external file-handle into a StringIO object (so self.fh is a StringIO copy of the contents of _ext_fh. In this way, when the reading routine exists and self.fh gets closed automatically by the garbage collection of the iterator the _ext_fh is still open and can be used to create another self.fh for further reads. This is a very hacky workaround I admit (and it requires you to duplicate the contents of the file-handle into memory), but I don't currently see any other way of fixing this besides rewriting all the read routines to not use the sile object as an iterator. Or am I missing something?

The usage of the iterator for some of the reading routines also leads to problems when using the sile inside a with context (i.e. with sisl.get_sile('somefile', 'r') as sile:. Because the sile_fh_open decorator does not open a new file-handle when the self.fh attribute already exists, you will get an error when trying to do another read operation after a read that uses iterators. For example, I got errors with the following:

with sisl.get_sile('siesta.out', 'r') as sile:
    Es = sile.read_energy()
    scf_info = sile.read_scf()

Because read_energy uses an iterator, the file handle self.fh is already closed before entering read_scf and we get an error. I tried to fix this by adding a check in sile_fh_open for the case where the self.fh attribute exists to see if the file handle is still open and if it is not open call self._open() to open it again. Do you agree with this solution or is there a better way to fix this?

Finally, another small bug I discovered is that read_energy from the outSileSiesta might fail if it is called after some other read operations. I suspect this is due to the fact that this reading routine uses step_to with reread=False so that if the stream position is after the energy block in the output file then it cannot find the energies. This could be fixed either by setting reread=True or using sile_fh_open(True) above the read_energy routine.

If you want to play around with it yourself, my current implementation is here https://github.com/ahkole/sisl/tree/sile-from-file-handle .

@pfebrer
Copy link
Contributor

pfebrer commented Sep 10, 2022

If you added the _ext_fh attribute, perhaps it's better to just add a check here:

sisl/sisl/io/sile.py

Lines 472 to 476 in 3f11290

def __getattr__(self, name):
""" Override to check the handle """
if name == 'fh':
raise AttributeError(f"The filehandle for {self.file} has not been opened yet...")
return getattr(self.fh, name)

If name == 'fh' it means a self.fh section of the code was reached but there is no fh attribute, so perhaps there we could check if there is this _ext_fh. Then everything that is applied to fh would be applied to _ext_fh as well, which makes sense to me.

@pfebrer
Copy link
Contributor

pfebrer commented Sep 10, 2022

Perhaps you would need to modify the hasattr(self, 'fh') section here:

sisl/sisl/io/sile.py

Lines 516 to 544 in 3f11290

def sile_fh_open(from_closed=False):
""" Method decorator for objects to directly implement opening of the
file-handle upon entry (if it isn't already).
Parameters
----------
from_closed : bool, optional
ensure the wrapped function *must* open the file, otherwise it will seek to 0.
"""
if from_closed:
def _wrapper(func):
@wraps(func)
def pre_open(self, *args, **kwargs):
if hasattr(self, "fh"):
self.fh.seek(0)
return func(self, *args, **kwargs)
with self:
return func(self, *args, **kwargs)
return pre_open
else:
def _wrapper(func):
@wraps(func)
def pre_open(self, *args, **kwargs):
if hasattr(self, "fh"):
return func(self, *args, **kwargs)
with self:
return func(self, *args, **kwargs)
return pre_open
return _wrapper

I'm not sure.

@ahkole
Copy link
Contributor Author

ahkole commented Sep 13, 2022

If name == 'fh' it means a self.fh section of the code was reached but there is no fh attribute, so perhaps there we could check if there is this _ext_fh. Then everything that is applied to fh would be applied to _ext_fh as well, which makes sense to me.

That could be an alternative to copying the file handle in _ext_fh to self.fh yes. We would still need to need to prevent the sile from trying to open a file handle using the self.file property (because this file does not exists). And we might be able to do this by modifying the sile_fh_open decorator as you suggest. I think this would also work yes. Do you think this is better than copying _ext_fh to self.fh?

Also, this does not fix the issue that I raised where the file handle gets closed when you exit an iterator loop in some of the read routines (at least I suspect that is the reason the file handle gets closed).

@pfebrer
Copy link
Contributor

pfebrer commented Sep 15, 2022

I'm not sure which of the approaches is better because I don't have the whole picture in my mind, I think Nick is the only one capable of answering that question :)

Anyway I have already said in some discussions, and this issue makes me yet be more confident about it, that the best solution would be to write read_* and write_* as functions (not methods). These functions could accept streams (file_handle, StringIO, etc...) and read from them. Then siles would just use them for their particular use case. Forcing external file handles into siles in my opinion makes the code unnnecessarily complex and too entangled to easily add new features in the future.

@zerothi
Copy link
Owner

zerothi commented Sep 28, 2022

Thanks for your work on this.

  1. We definitely need a solution to this.
  2. What I am worried about in the long-term for the read_*/write_* methods is the shear amount of unique function names we need. Asking users to search in them might be too much? Alternatively we should test this with say only the geometry and do as ase does it, a single read_geometry function that can accept, Union[str, Path, StringIO, FileHandle] and then redirect where needed. This would still be a bit problematic for some classes since what you really want is to disect the classes functionality. This also leads to other problems in files that contain keyworded elements, say all input files like fdf, xsf and pdb files. In those files the order of elements may be arbitrary and it depends on which position of the file-handle you pass to the read function. So what should we do with those files? (agreed, there aren't a lot of them, but there are some, and some of them are quite useful in MD simulations from various codes output). And should one be able to backtrack a file-handle?

Why is it difficult to add new features with the classes? I mean, if you want to read an xyz some place in the middle, and something from a cube file, I imagine something like this:

fh = ...
geom = xyzSile(fh).read_geometry() # this will step the file-handle
grid = cubeSile(fh).read_grid() # this will also step the file-handle

@zerothi
Copy link
Owner

zerothi commented Sep 28, 2022

For instance, I can get a subclassed element to work like this:

class fhxyz(xyzSile):

    def __init__(self, filehandle, comment=None, *args, **kwargs):
        self._file = Path(filehandle.name)
        self._mode = filehandle.mode
        if isinstance(comment, (list, tuple)):
            self._comment = list(comment)
        elif not comment is None:
            self._comment = [comment]
        else:
            self._comment = []
        self._line = 0
        self.fh = filehandle

    def _open(self):
        pass

    def __enter__(self):
        return self

    def __exit__(self, type, value, traceback):
        # we will not close a file-handle
        return False

This could be part of the __new__ function to change it at creation.

@pfebrer
Copy link
Contributor

pfebrer commented Sep 28, 2022

I'm not saying it would be hard to use, of course the API can be very simple. I'm more worried about the internal complexity that it adds unnecessarily.

I agree that it might require some work at some points where the reading functionality is bound to the class. But is it that much? If you were to build it from scratch, how would you do it? What will be easier to extend in the future?

The added internal complexity also leads to obscure behaviors like the one @ahkole was pointing, which you can't avoid unless you modify or subclass library code.

This also leads to other problems in files that contain keyworded elements, say all input files like fdf, xsf and pdb files. In those files the order of elements may be arbitrary and it depends on which position of the file-handle you pass to the read function

I don't see a problem with that. The user should know the position of their file handle, if it matters.

And should one be able to backtrack a file-handle?

I think the bare reading and writing functions should not be able to backtrack a handle unless it is explicitly specified by the user in cases where it can make sense.

@zerothi
Copy link
Owner

zerothi commented Sep 29, 2022

I'm not saying it would be hard to use, of course the API can be very simple. I'm more worried about the internal complexity that it adds unnecessarily.

Agreed, this is a bit worrying, however, the idea would be that this small complexity should be stable, i.e. the above class structure is a bit complex, but very small, and hence manageable. It shouldn't be changed everywhere, only in the __new__ of Sile, no where else?

I agree that it might require some work at some points where the reading functionality is bound to the class. But is it that much? If you were to build it from scratch, how would you do it? What will be easier to extend in the future?

I don't see a huge benefit for functions over classes, I mean that many of these siles are rather simple, some are complex, but those are only the keyworded files (fdf, etc.).
Like I said, I am not against it, I just think it is a major effort as it is. I would be happy to review a PR with a suggestion for one of the files, say xyz as you had use for that.
Do note that the functionality provided by the class above should do exactly the same, no?

The added internal complexity also leads to obscure behaviors like the one @ahkole was pointing, which you can't avoid unless you modify or subclass library code.

My point would be that siles should generally be agnostic when it comes to file-handles or file-names, i.e. all it does is read and parse content in some buffer. So the classes should be the same, regardless.

This also leads to other problems in files that contain keyworded elements, say all input files like fdf, xsf and pdb files. In those files the order of elements may be arbitrary and it depends on which position of the file-handle you pass to the read function

I don't see a problem with that. The user should know the position of their file handle, if it matters.

But it matters greatly, you can't say,

  1. read the supercell
  2. then read the geometry

Because then you have passed the supercell position. My experience is that this will be confusing to end-users.
Actually my current solution would enable this as it would enable re-reading from where the file-handle was initially.

To add to my point about users, @ahkole's bug on the read_energy in the outsile is actually precisely this problem. The out sile is state aware and will continue read where it is, and thus should never backtrack. Would @ahkole expect that it couldn't find the energies if passing a file-handle?

Sorry, I didn't respond to your bug before @ahkole, now I did. I should clarify this in the sile.

And should one be able to backtrack a file-handle?

I think the bare reading and writing functions should not be able to backtrack a handle unless it is explicitly specified by the user in cases where it can make sense.

I agree.

Again, I am not per see against this, if it could aid in the use of files, then by all means. I must admit I just don't see a big improvement considering the efforts required to re-implement them. And, I would be very happy with a proof of concept, so don't hold back... ;)

To come back to this, here is an updated class that also handles StringIO and can backtrack to the passed file-position.

from io import StringIO
from pathlib import Path
from sisl.io.xyz import xyzSile

fh = open("CAP.xyz", "r")
string = StringIO(fh.read())
fh.seek(0)

class handle_or_StringIO_xyz(xyzSile):

    def __init__(self, filehandle, *args, **kwargs):
        try:
            filename = Path(filehandle.name)
        except:
            # this is not optimal, it will be the current directory, but one should not be able
            # to write to it
            filename = Path()
        try:
            mode = filehandle.mode
        except:
            # a StringIO will always be able to read *and* write
            # to its buffer
            mode = 'rw'
        self.fh = filehandle
        self._fh_init_tell = filehandle.tell()
        super().__init__(filename, mode, *args, **kwargs)

    def _open(self):
        self.fh.seek(self._fh_init_tell)
        self._line = 0

    def __exit__(self, type, value, traceback):
        # we will not close a file-handle
        self._line = 0
        return False

fh_geom = handle_or_StringIO_xyz(fh).read_geometry()
string_geom = handle_or_StringIO_xyz(string).read_geometry()
geom = xyzSile("CAP.xyz").read_geometry()
print(geom.equal(fh_geom))
print(geom.equal(string_geom))

@pfebrer
Copy link
Contributor

pfebrer commented Sep 29, 2022

Yes perhaps it's more practical to put it inside the Sile class.

However I think that this will make it more difficult for external libraries to rely on sisl for parsing since they have to trust the class inner workings, which can be obscure and create overheads that will not be easy to avoid. At least not in a simple, elegant way. I don't know if this is something to worry about.

@zerothi
Copy link
Owner

zerothi commented Sep 29, 2022

However I think that this will make it more difficult for external libraries to rely on sisl for parsing since they have to trust the class inner workings, which can be obscure and create overheads that will not be easy to avoid. At least not in a simple, elegant way. I don't know if this is something to worry about.

Agreed that our solution should be stable.

@zerothi
Copy link
Owner

zerothi commented Sep 30, 2022

Could you please have a look at the branch 484-filehandle for a first attempt... More tests should be added before final merge, probably...

@zerothi
Copy link
Owner

zerothi commented Oct 3, 2022

@ahkole could you please have a look and see if the current branch level would fix the issues for you? It should be able to do what you requested without copying buffers etc. (I hope ;))

@zerothi
Copy link
Owner

zerothi commented Oct 4, 2022

@ahkole now it should be in its final stage. Once you approve and have tested with aiida, let me know and I will merge it into main!

@ahkole
Copy link
Contributor Author

ahkole commented Oct 4, 2022

@zerothi I have tested the newest version with aiida and it seems to work very well for all the text-based output files retrieved by aiida! There is still the bug though that if you call a reading routine that does the reading by iterating over the sile object (i.e. read_energy of the outSileSiesta) then when this call exits then the filehandle that you supplied when creating the sile is closed, i.e. the following code snipped printed True:

with node.outputs.retrieved.open('aiida.out', 'r') as fh:
    sile = sisl.io.siesta.outSileSiesta(fh)
    test_data = sile.read_energy()
    print(fh.closed)

This is not what we want, right? I think the culprit here is the definition of the iterator object for the Sile class:

def __iter__(self):
    r""" Iterator for file """
    yield from self.fh

If instead we don't yield from the filehandle, but read the lines ourselves and yield those then the bug of the filehandle being closed goes away, i.e. with something like this:

def __iter__(self):
    r""" Iterator for file """
    l = self.readline()
    while l:
        yield l
        l = self.readline()

If I change this in the code then fh.closed returns False after calling read_energy. An added benefit of this approach might be that self._line also gets updated accordingly if we use the sile as an iterator. The only thing that I'm not sure about is if we should call self.readline with comment=True or not. What do you think?

@ahkole
Copy link
Contributor Author

ahkole commented Oct 4, 2022

@zerothi I also had a few additional feature requests related to this.

  1. Would it be possible to also make get_sile work with filehandles by using the name attribute of a filehandle to determine the corresponding sile class (provided the stream is actually a filehandle and not a different kind of text stream)? Maybe also optionally with the possibility of specifying a filename (or extension) yourself if it cannot be retrieved from the stream.
  2. When playing with this I also discovered that there is a way to open a NetCDF4 object from a Python buffer object (see the memory parameter of Dataset here https://unidata.github.io/netcdf4-python/#Dataset ). Would it be possible to also create a BufferSile for NetCDF4 siles by using this feature? That way we would also be able to create NetCDF4 sile from filehandles (and other byte streams).

@zerothi
Copy link
Owner

zerothi commented Oct 4, 2022

@zerothi I have tested the newest version with aiida and it seems to work very well for all the text-based output files retrieved by aiida! There is still the bug though that if you call a reading routine that does the reading by iterating over the sile object (i.e. read_energy of the outSileSiesta) then when this call exits then the filehandle that you supplied when creating the sile is closed, i.e. the following code snipped printed True:

with node.outputs.retrieved.open('aiida.out', 'r') as fh:
    sile = sisl.io.siesta.outSileSiesta(fh)
    test_data = sile.read_energy()
    print(fh.closed)

This is not what we want, right? I think the culprit here is the definition of the iterator object for the Sile class:

def __iter__(self):
    r""" Iterator for file """
    yield from self.fh

If instead we don't yield from the filehandle, but read the lines ourselves and yield those then the bug of the filehandle being closed goes away, i.e. with something like this:

def __iter__(self):
    r""" Iterator for file """
    l = self.readline()
    while l:
        yield l
        l = self.readline()

If I change this in the code then fh.closed returns False after calling read_energy. An added benefit of this approach might be that self._line also gets updated accordingly if we use the sile as an iterator. The only thing that I'm not sure about is if we should call self.readline with comment=True or not. What do you think?

You are probably right that we don't want ever to close the buffer. I'll amend.

@zerothi
Copy link
Owner

zerothi commented Oct 4, 2022

@zerothi I also had a few additional feature requests related to this.

1. Would it be possible to also make `get_sile` work with filehandles by using the `name` attribute of a filehandle to determine the corresponding sile class (provided the stream is actually a filehandle and not a different kind of text stream)? Maybe also optionally with the possibility of specifying a filename (or extension) yourself if it cannot be retrieved from the stream.

2. When playing with this I also discovered that there is a way to open a NetCDF4 object from a Python buffer object (see the `memory` parameter of `Dataset` here https://unidata.github.io/netcdf4-python/#Dataset ). Would it be possible to also create a `BufferSile` for NetCDF4 siles by using this feature? That way we would also be able to create NetCDF4 sile from filehandles (and other byte streams).

Please open a new issue with specific details. :)

@zerothi
Copy link
Owner

zerothi commented Oct 4, 2022

@zerothi I also had a few additional feature requests related to this.

1. Would it be possible to also make `get_sile` work with filehandles by using the `name` attribute of a filehandle to determine the corresponding sile class (provided the stream is actually a filehandle and not a different kind of text stream)? Maybe also optionally with the possibility of specifying a filename (or extension) yourself if it cannot be retrieved from the stream.

Using the buffer.name attribute would not work for StringIO instances. My initial thoughts is that the .name should not be used. If anything one should be able to pass a buffername as an argument to overwrite the filename on get_sile.
To add to this, I don't think we should do it like this. A better approach would be:

xyz_sile_class = si.get_sile_class("xyz")
xyz = xyz_sile_class(filebuffer)
# or in one
xyz = si.get_sile_class("xyz")(filebuffer)

this would be my recommended way to do it.

@ahkole
Copy link
Contributor Author

ahkole commented Oct 4, 2022

@zerothi I also had a few additional feature requests related to this.

1. Would it be possible to also make `get_sile` work with filehandles by using the `name` attribute of a filehandle to determine the corresponding sile class (provided the stream is actually a filehandle and not a different kind of text stream)? Maybe also optionally with the possibility of specifying a filename (or extension) yourself if it cannot be retrieved from the stream.

Using the buffer.name attribute would not work for StringIO instances. My initial thoughts is that the .name should not be used. If anything one should be able to pass a buffername as an argument to overwrite the filename on get_sile. To add to this, I don't think we should do it like this. A better approach would be:

xyz_sile_class = si.get_sile_class("xyz")
xyz = xyz_sile_class(filebuffer)
# or in one
xyz = si.get_sile_class("xyz")(filebuffer)

this would be my recommended way to do it.

Ah, yes, that is actually a much better way and works out of the box :)
No reason to change anything about get_sile then. If one is determined to use the name attribute one could always do get_sile_class(filehandle.name) themselves.

@ahkole
Copy link
Contributor Author

ahkole commented Oct 4, 2022


Please open a new issue with specific details. :)

I'll try to play around with these python buffers and see if I can open a NetCDF4 Dataset like that and if it works I'll create a new issue with the details on the feature request :)

@pfebrer
Copy link
Contributor

pfebrer commented Oct 5, 2022

Perhaps you should be able to pass the format string to the cls argument of get_sile?

I.e. : get_sile(file_handle, cls="xyz") instead of get_sile_class("xyz")(file_handle)

I'm saying this just because two consecutive calls usually makes code seem more complex.

@zerothi
Copy link
Owner

zerothi commented Oct 5, 2022

Perhaps you should be able to pass the format string to the cls argument of get_sile?

I.e. : get_sile(file_handle, cls="xyz") instead of get_sile_class("xyz")(file_handle)

I'm saying this just because two consecutive calls usually makes code seem more complex.

This won't work because sisl tries to figure out file extension before diverting to class arguments...
get_sile_class is the way to go.

@pfebrer
Copy link
Contributor

pfebrer commented Oct 5, 2022

But isn't it as simple as modifying get_sile to pass cls to get_sile_cls as the filename in case cls is a string? Basically doing the same thing that you suggest, but inside sisl's code.

@zerothi
Copy link
Owner

zerothi commented Oct 5, 2022

But isn't it as simple as modifying get_sile to pass cls to get_sile_cls as the filename in case cls is a string? Basically doing the same thing that you suggest, but inside sisl's code.

A PR would be welcome... However, the problem is that get_sile* relies on users being able to pass SileSiesta for instance, which limits the classes returned to be of that sub-class. So the basic machinery allows one to not be specific about the file name and then one parses the filename anyways.
It requires some major refactoring (I assume) in the get_sile_class method...

@pfebrer
Copy link
Contributor

pfebrer commented Oct 6, 2022

Hmm I was thinking on just changing the three lines of get_sile:

sisl/sisl/io/sile.py

Lines 312 to 314 in 7b043ed

cls = kwargs.pop('cls', None)
sile = get_sile_class(file, *args, cls=cls, **kwargs)
return sile(Path(str_spec(str(file))[0]), *args, **kwargs)

to

    cls = kwargs.pop('cls', None)
    if isinstance(cls, string):
        sile = get_sile_class(cls, *args, **kwargs) 
    else:
        sile = get_sile_class(file, *args, cls=cls, **kwargs) 
    return sile(Path(str_spec(str(file))[0]), *args, **kwargs) 

@zerothi
Copy link
Owner

zerothi commented Oct 6, 2022

I'd rather not. I don't think it becomes particularly clean... :(

Also, in the future I had in mind that cls could be a string that would retrieve Siles and its sub-classed equivalents instead of having users to pass explicit classes. I must admit, having a call on a method return value is not too complex... ;)

@pfebrer
Copy link
Contributor

pfebrer commented Jan 24, 2023

What was the problem for using binary streams instead of text streams (i.e. BytesIO instead of StringIO)? I can't recall.

In particular we are encountering a situation where we would like to parse the content of siesta basis files into Atom objects. We can do it for xml files now, but not for NetCDF basis files.

@zerothi
Copy link
Owner

zerothi commented Jan 25, 2023

I can't recall, I guess it should be done similarly to the way the BufferSile now works. So yes, we should do something there.

However, first we should be sure that the netcdf4 package actually provides this feature, see here:
Unidata/netcdf4-python#406
Unidata/netcdf4-python#550

It should be there, Unidata/netcdf4-python#652

@zerothi
Copy link
Owner

zerothi commented Jan 26, 2023

I see that @ahkole was going to try out a buffer with NetCDF, #484 (comment)

any progress there?

@ahkole
Copy link
Contributor Author

ahkole commented Jan 26, 2023

I see that @ahkole was going to try out a buffer with NetCDF, #484 (comment)

any progress there?

I looked into it a little bit. You can read netCDF data from a BytesIO using the memory parameter of the netCDF Dataset constructor (see https://unidata.github.io/netcdf4-python/#Dataset), the cleanest way I found so far to do it is,

data = BytesIO(some data)
view = data.getbuffer()  # Cannot directly pass a BytesIO object to Dataset but you can pass a memory view, so this way you don't have to copy the contents of the BytesIO object
dataset = Dataset('arbitrary name', mode='r', memory=view)

# Do something with dataset

view.release()  # To release the buffer of the BytesIO object

I have not tried using this to create a binary equivalent of the BufferSile but it should be possible. If you have a file handle instead of a BytesIO, then either you have to read the entire contents of the file to a bytes object and pass that to Dataset (which is suboptimal of course), or you could perhaps extract the name of the file from the filehandle and then pass that to the Dataset constructor. Maybe there's a nicer way to handle the case where you supply a filehandle object but I haven't found it yet.

This only works for netCDF binary files by the way. If you want to handle any other binary files using some sort of BufferSile you have to either figure out a way to read from a Python buffer using Fortran (maybe this is possible using C-interoperability because you can read from Python buffers using C) or you would have to re-implement the reading routines for the binary files directly in Python.

@zerothi
Copy link
Owner

zerothi commented Jan 27, 2023

Great, so it sounds like this could be passed in the same mechanism as we did for the regular siles.

Perhaps we should subclass BufferSile into TextBufferSile and BytesBufferSile and for NetCDF files use the BytesBufferSile. In this way users would be able to use BufferSile for both text and bytes stuff in NetCDF? any takers?

@pfebrer
Copy link
Contributor

pfebrer commented Jan 27, 2023

I leave it to @ahkole if he wants to do it :)

@ahkole
Copy link
Contributor Author

ahkole commented Jan 27, 2023

I could have a look at it next week, sure. It hopefully shouldn't be too difficult to include support for NetCDF files.

@ahkole
Copy link
Contributor Author

ahkole commented Mar 30, 2023

@zerothi @pfebrer Sorry for my late reply, I was so busy with other projects that I have not been able to look at this yet. I can still have a look at it but it might take some time before I can do this because of other responsibilities. How urgent is this support for netCDF files?

@pfebrer
Copy link
Contributor

pfebrer commented Mar 30, 2023

It's not urgent :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants