Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alternative (easier?) way to define datasets #11

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion MDAnalysisData/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
__all__ = ['datasets']

from . import datasets

from .base import fetch, DATASET_NAMES



Expand Down
100 changes: 22 additions & 78 deletions MDAnalysisData/adk_equilibrium.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,88 +5,32 @@
https://figshare.com/articles/Molecular_dynamics_trajectory_for_benchmarking_MDAnalysis/5108170/1
"""

from os.path import dirname, exists, join
from os import makedirs, remove
import codecs

import logging

from .base import get_data_home
from .base import _fetch_remote
from .base import RemoteFileMetadata
from .base import Bunch
from .base import RemoteFileMetadata, Dataset

NAME = "adk_equilibrium"
DESCRIPTION = "adk_equilibrium.rst"
# The original data can be found at the figshare URL.
# The SHA256 checksum of the zip file changes with every download so we
# cannot check its checksum. Instead we download individual files.
# separately. The keys of this dict are also going to be the keys in the
# Bunch that is returned.
ARCHIVE = {
'topology': RemoteFileMetadata(
filename='adk4AKE.psf',
url='https://ndownloader.figshare.com/files/8672230',
checksum='1aa947d58fb41b6805dc1e7be4dbe65c6a8f4690f0bd7fc2ae03e7bd437085f4',
),
'trajectory': RemoteFileMetadata(
filename='1ake_007-nowater-core-dt240ps.dcd',
url='https://ndownloader.figshare.com/files/8672074',
checksum='598fcbcfcc425f6eafbe9997238320fcacc6a4613ecce061e1521732bab734bf',
),
}

logger = logging.getLogger(__name__)


def fetch_adk_equilibrium(data_home=None, download_if_missing=True):
"""Load the AdK 1us equilibrium trajectory (without water)

Parameters
----------
data_home : optional, default: None
Specify another download and cache folder for the datasets. By default
all MDAnalysisData data is stored in '~/MDAnalysis_data' subfolders.
This dataset is stored in ``<data_home>/adk_equilibrium``.
download_if_missing : optional, default=True
If ``False``, raise a :exc:`IOError` if the data is not locally available
instead of trying to download the data from the source site.

Returns
-------
dataset : dict-like object with the following attributes:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and another thing lost is the clear description of what the returned Bunch will have....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to loose this...

dataset.topology : filename
Filename of the topology file
dataset.trajectory : filename
Filename of the trajectory file
dataset.DESCR : string
Description of the trajectory.


See :ref:`adk-equilibrium-dataset` for description.
"""
name = NAME
data_location = join(get_data_home(data_home=data_home),
name)
if not exists(data_location):
makedirs(data_location)

records = Bunch()
for file_type, meta in ARCHIVE.items():
local_path = join(data_location, meta.filename)
records[file_type] = local_path

if not exists(local_path):
if not download_if_missing:
raise IOError("Data {0}={1} not found and `download_if_missing` is "
"False".format(file_type, local_path))
logger.info("Downloading {0}: {1} -> {2}...".format(
file_type, meta.url, local_path))
archive_path = _fetch_remote(meta, dirname=data_location)

module_path = dirname(__file__)
with codecs.open(join(module_path, 'descr', DESCRIPTION),
encoding="utf-8") as dfile:
records.DESCR = dfile.read()

return records
class ADK_Equilibrium(Dataset):
NAME = "adk_equilibrium"
DESCRIPTION = "adk_equilibrium.rst"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One downside I can see is we've lost the short description that the function to generate this had

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't want to loose the description and don't want to loose the docs...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So currently it looks like:

>>> print(ds.__doc__)
AdK 1us equilibrium trajectory (without water)

    Attributes
    ----------
    topology : filename
         Filename of the topology file
    trajectory : filename
         Filename of the trajectory file
    DESCR : string
         Description of the trajectory.

So mostly there still


# The original data can be found at the figshare URL.
# The SHA256 checksum of the zip file changes with every download so we
# cannot check its checksum. Instead we download individual files.
# separately. The keys of this dict are also going to be the keys in the
# Bunch that is returned.
ARCHIVE = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we normalize all of this then we might be able to just put all these data into JSON or YAML files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this is essentially py-son at this point

'topology': RemoteFileMetadata(
filename='adk4AKE.psf',
url='https://ndownloader.figshare.com/files/8672230',
checksum='1aa947d58fb41b6805dc1e7be4dbe65c6a8f4690f0bd7fc2ae03e7bd437085f4',
),
'trajectory': RemoteFileMetadata(
filename='1ake_007-nowater-core-dt240ps.dcd',
url='https://ndownloader.figshare.com/files/8672074',
checksum='598fcbcfcc425f6eafbe9997238320fcacc6a4613ecce061e1521732bab734bf',
),
}
57 changes: 56 additions & 1 deletion MDAnalysisData/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,11 @@

import shutil
from collections import namedtuple
from os import environ, listdir, makedirs
from os import environ, listdir, makedirs, remove
from os.path import dirname, exists, expanduser, isdir, join, splitext
import hashlib
import codecs




Expand Down Expand Up @@ -94,6 +96,59 @@ def __setstate__(self, state):
RemoteFileMetadata = namedtuple('RemoteFileMetadata',
['filename', 'url', 'checksum'])

DATASET_NAMES = {}

class _DatasetRegister(type):
def __new__(meta, name, bases, class_dict):
cls = type.__new__(meta, name, bases, class_dict)
if not cls.NAME is None:
DATASET_NAMES[cls.NAME] = cls
return cls


class Dataset(Bunch, metaclass=_DatasetRegister):
NAME = None
DESCRIPTION = None
ARCHIVE = None

def __init__(self, data_home=None, download_if_missing=True):
data_location = join(get_data_home(data_home=data_home),
self.NAME)

if not exists(data_location):
makedirs(data_location)

contents = {}
for file_type, meta in self.ARCHIVE.items():
local_path = join(data_location, meta.filename)
contents[file_type] = local_path

if not exists(local_path):
if not download_if_missing:
raise IOError("Data {0}={1} not found and `download_if_missing` is "
"False".format(file_type, local_path))
logger.info("Downloading {0}: {1} -> {2}...".format(
file_type, meta.url, local_path))
archive_path = _fetch_remote(meta, dirname=data_location)

module_path = dirname(__file__)
with codecs.open(join(module_path, 'descr', self.DESCRIPTION),
encoding="utf-8") as dfile:
contents['DESCR'] = dfile.read()


# finally, init the Bunch object
super().__init__(**contents)


def fetch(dataset, data_home=None, download_if_missing=True):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows MDADATA.fetch('adk_equilibrium'), rather than a separate function for each dataset

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a reason why explicit functions: tab completion and introspection. (sklearn does it and it works really well – much better than having to know the name of the dataset)

I'd like to keep explicit functions – both for ease of use and for same "look and feel" as sklearn.datasets (as well as getting docs!)

We can have a generic mechanism and generate the fetch_* functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's nice for this case. But did you look at some of the other accessors like fetch_adk_transitions_DIMS where we get a tar file and unpack? We might be able to reduce our requirements to these two types of cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could put the compression/other info into the RemoteFileMetaData object

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok yeah the namespace is nice, we could implement the static functions as

def fetch_adk():
    return base.fetch('adk')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RemoteFileMetadata is verbatim from sklearn. Might be useful to keep it that way and really keep it simple.

If anything, we should build data structures that contain RemoteFileMetadata instances that map remote and local. Have a look at the transitions dataset to see what else we have.

Finally, have a look at sklearn.datasets (and the outstanding docs) to see the variance. I think one reason for copy&paste code is that ultimately each dataset in the wild might have slightly different requirements. Still, that's not to say that we can't try to get a bit of order in ;-).

"""Grab a named dataset"""
try:
return DATASET_NAMES[dataset](data_home=data_home,
download_if_missing=True)
except KeyError:
raise KeyError("unknown dataset: {}".format(dataset))


def get_data_home(data_home=None):
"""Return the path of the MDAnalysisData data dir.
Expand Down
3 changes: 1 addition & 2 deletions MDAnalysisData/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@


from .base import get_data_home, clear_data_home
from .adk_equilibrium import fetch_adk_equilibrium
from . import adk_equilibrium
from .adk_transitions import (fetch_adk_transitions_DIMS,
fetch_adk_transitions_FRODA)
from .ifabp_water import fetch_ifabp_water
Expand All @@ -16,7 +16,6 @@
__all__ = [
'get_data_home',
'clear_data_home',
'fetch_adk_equilibrium',
'fetch_adk_transitions_DIMS',
'fetch_adk_transitions_FRODA',
'fetch_ifabp_water',
Expand Down