Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryReader doesn't have a filename attribute #1027

Closed
6 tasks
kain88-de opened this issue Oct 14, 2016 · 24 comments
Closed
6 tasks

MemoryReader doesn't have a filename attribute #1027

kain88-de opened this issue Oct 14, 2016 · 24 comments

Comments

@kain88-de
Copy link
Member

kain88-de commented Oct 14, 2016

Problem

When I need a real copy of an AtomGroup i usually do

u2 = mda.Universe(ag.universe.filename, ag.universe.trajectory.filename)

This create a independent copy of the original universe. This is good to create references from a single Atomgroup argument in analysis functions. This doesn't work any more then the AtomGroup is a MemoryReader.

My main issue is that I can't create independent copies of atomgroups like I used to when I use the MemoryReader. There are two solutions for me. I can either create a completely independent copy another way or the MemoryReader still keeps a reference to the original file used to create it (but I admit that might be confusing).

Desired Implementation

  • add a filename attribute to the MemoryReader
    • set filename = None if constructed from an array or another anonymous source
    • set filename = <trajectory-filename> if it comes from a universe with an associated trajectory filename
  • in Universe.transfer_to_memory, set the filename attribute
  • test that the filename attribute of a MemoryReader instance built from an array is None
  • test that the filename attribute of a MemoryReader instance built via transfer_to_memory is the same as the source filename

History

@richardjgowers
Copy link
Member

Really we should have a Universe.clone or Universe.copy method that creates
this.

On Fri, 14 Oct 2016, 9:53 a.m. Max Linke, [email protected] wrote:

When I need a real copy of an AtomGroup i usually do

u2 = mda.Universe(ag.universe.filename, ag.universe.trajectory.filename)

This create a independent copy of the original universe. This is good to
create references from a single Atomgroup argument in analysis functions.
This doesn't work any more then the AtomGroup is a MemoryReader.

My main issue is that I can't create independent copies of atomgroups like
I used to when I use the MemoryReader. There are two solutions for me. I
can either create a completely independent copy another way or the
MemoryReader still keeps a reference to the original file used to create it
(but I admit that might be confusing).


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1027, or mute the thread
https://github.com/notifications/unsubscribe-auth/AI0jBxjimHNzF6iv42UZqJbbvvakLismks5qz0MggaJpZM4KWyBL
.

@kain88-de
Copy link
Member Author

But the filename might still be interesting for the MemoryReader to know from where the data originally comes.

@kain88-de
Copy link
Member Author

I opened #1029 to discuss the copying of a universe. I'm would like some more feedback is the MemoryReader should hold some information about the original file used to load the data. @wouterboomsma and @mtiberti do you have an opinion on that?

@richardjgowers
Copy link
Member

If I remember right, you can init a memoryreader with just a numpy array, so that's a nice corner case for someone to figure out too :)

@wouterboomsma
Copy link
Contributor

@kain88-de, It would be a bit odd to have a filename in the MemoryReader. At least in our use cases, we frequently construct MemoryReaders directly from a numpy array, or as a merge from different trajectories. For instance, we often use something like this to merge universes:

def merge_universes(universes):
    """
    Merge list of universes into one

    Parameters
    ----------
    `universes` : list of Universe objects


    Returns
    ----------
    Universe object
    """

    for universe in universes:
        universe.transfer_to_memory()

    return mda.Universe(
        universes[0].filename,
        np.concatenate(tuple([e.trajectory.timeseries() for e in universes]),
        axis=1),
        format=MemoryReader)

As @richardjgowers suggests, a Universe.clone or Universe.copy could be a path forward although this function would require an explicit isinstance test for MemoryReader just as we already have in Universe.transfer_to_memory - which I can understand one might like to avoid. Another path would be to give readers a copy() or clone() method - but this would require that we added support for a copy-constructor like functionality in Universe, rather than the using the filenames directly:

u2 = mda.Universe(ag.universe.topology, ag.universe.trajectory.copy())

...which could of course also be hidden inside a Universe.copy(), bringing us back to @richardjgowers suggestion :)

u2 = ag.universe.copy(copy_topology=False, copy_trajectory=True)

@richardjgowers
Copy link
Member

It might be handy if transfer_to_memory did store what the original filename was, and it doesn't cost anything to do it. But .filename isn't always going to be available.

@kain88-de
Copy link
Member Author

What if we give a name when the MemoryReader is constructed from a universe with known filename. If the MemoryReader is constructed from a array we set the filename to None. What something like that work?

@richardjgowers
Copy link
Member

Yes, or something like 'numpy'

On Mon, 17 Oct 2016, 10:01 p.m. Max Linke, [email protected] wrote:

What if we give a name when the MemoryReader is constructed from a
universe with known filename. If the MemoryReader is constructed from a
array we set the filename to None. What something like that work?


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#1027 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AI0jB_cXEUe_aRuPFaF-o-6Br0rxrpt3ks5q0-IlgaJpZM4KWyBL
.

@kain88-de
Copy link
Member Author

I would prefer a more general arraybut my mind tells me that a None is better since we can catch it in library code and we throw correct type errors for users.

@orbeckst
Copy link
Member

Can we just decide to add filename = None to the MemoryReader, document it, and then let application code deal with it?

(For context: The NamedStream class main reason for existence is to add an actual filename so that code can decide on the format of the stream. This is not needed here because the Reader API is known.)

@wouterboomsma
Copy link
Contributor

Sounds like a good compromise to have the filename attribute set when it makes sense.

@orbeckst
Copy link
Member

orbeckst commented Oct 26, 2016

I haven't heard anything else so for right now so I conclude that the consensus is to

  1. add a filename attribute to the MemoryReader
  2. set filename = None if constructed from an array or another anonymous source
  3. set filename = <trajectory-filename> if it comes from a universe with an associated trajectory filename

(Note on 3: The ChainReader will only give you the current trajectory filename without indication that there are other files in the chain.)

EDIT: properly paraphrased #1027 (comment)

@utkbansal
Copy link
Member

utkbansal commented Jan 28, 2017

I'd like to work on this.

@utkbansal
Copy link
Member

utkbansal commented Jan 29, 2017

@richardjgowers From what I understand, a MemoryReader is used when transfer_to_memory is called on the Universe object , or when in_memory=True is given while instantiating the Universe object.

What I don't understand is

u2 = mda.Universe(ag.universe.filename, ag.universe.trajectory.filename)
This create a independent copy of the original universe.

What I have seen in tests is something like this -
Universe(PDB_small, DCD, in_memory=True)

Is ag.universe.filename the PDB_small and ag.universe.trajectory.filename the DCD files? And the filename attribute in the MemoryReader will have the path to the DCD file?

@jbarnoud
Copy link
Contributor

jbarnoud commented Mar 19, 2017

Edit: Moved TODO list into Issue main body - RG

@utkbansal Are you still interested in tackling this issue?

@richardjgowers
Copy link
Member

I've moved the Issue goals into the main body of this issue so it's all in one (easy to find) place.

Is it safe to have MemoryReader.filename return two different types? (None or str).

@jbarnoud
Copy link
Contributor

@richardjgowers It might break in a few places. But these places probably break already without us knowing. Anyway, it is better to have the API clearly defined so the rest of the code can adapt. Also, an attribute being None when not relevant looks rather common in python.

@richardjgowers
Copy link
Member

richardjgowers commented Mar 19, 2017 via email

@utkbansal
Copy link
Member

@jbarnoud Sure, I'm happy to work on this.

@kain88-de
Copy link
Member Author

Is this also the correct solution if we change the underlying numpy arrays? I commonly write code to clone a universe like this u = mda.Universe(other.filename, other.trajectory.filename). I'm ok with this throwing an error due to a None value. But I'm concerned about the silent bug I get when this loads the original trajectory and not makes a copy of the memory reader. Maybe this is a sign we need an real copy function for universes.

@orbeckst
Copy link
Member

orbeckst commented Mar 20, 2017

Regarding #1027 (comment) : should we consider URI-style names: e.g

  • "file://trajectory.xtc"
  • "memory://trajectory.xtc"
  • None if not associate with any file-like object

(And if we ever get streams working: "https://pdb.org/.../1ake.mmtf" or similar.)

EDIT: This does not really addess @kain88-de 's concern, though. Rather, when changing the underlying array of MemoryReader, the filename attribute should be set to None to indicate that there exists no actual file-like object that contains the data.

@jbarnoud
Copy link
Contributor

@orbeckst I am a bit worried adding a protocol prefix would make the use of the attribute overly complicated. Indeed, it means adding a parsing step. Also, we would have to have the protocol bit in the filename for every reader.

@orbeckst
Copy link
Member

Yes, it would make it more complicated.

Also, we would have to have the protocol bit in the filename for every reader.

This is really not solving any immediate problem, so yes, ignore ;-). But I think it's worth thinking about eventually.

@mnmelo
Copy link
Member

mnmelo commented Mar 21, 2017

@orbeckst, rather than overloading the filename string with extra info we could add a kwarg or an attribute to the readers. Reader.source, for instance, defaulting to "file".

jbarnoud pushed a commit that referenced this issue Mar 22, 2017
Fixes #1027

Changes made in this Pull Request:

Adds filename attribute to MemoryReader
Adds tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants