-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
alternative (easier?) way to define datasets #11
base: master
Are you sure you want to change the base?
Changes from 1 commit
bc28bdd
3b5dc82
7e1cbd3
665517b
369266a
bfd2c88
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,7 +8,7 @@ | |
__all__ = ['datasets'] | ||
|
||
from . import datasets | ||
|
||
from .base import fetch, DATASET_NAMES | ||
|
||
|
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,88 +5,32 @@ | |
https://figshare.com/articles/Molecular_dynamics_trajectory_for_benchmarking_MDAnalysis/5108170/1 | ||
""" | ||
|
||
from os.path import dirname, exists, join | ||
from os import makedirs, remove | ||
import codecs | ||
|
||
import logging | ||
|
||
from .base import get_data_home | ||
from .base import _fetch_remote | ||
from .base import RemoteFileMetadata | ||
from .base import Bunch | ||
from .base import RemoteFileMetadata, Dataset | ||
|
||
NAME = "adk_equilibrium" | ||
DESCRIPTION = "adk_equilibrium.rst" | ||
# The original data can be found at the figshare URL. | ||
# The SHA256 checksum of the zip file changes with every download so we | ||
# cannot check its checksum. Instead we download individual files. | ||
# separately. The keys of this dict are also going to be the keys in the | ||
# Bunch that is returned. | ||
ARCHIVE = { | ||
'topology': RemoteFileMetadata( | ||
filename='adk4AKE.psf', | ||
url='https://ndownloader.figshare.com/files/8672230', | ||
checksum='1aa947d58fb41b6805dc1e7be4dbe65c6a8f4690f0bd7fc2ae03e7bd437085f4', | ||
), | ||
'trajectory': RemoteFileMetadata( | ||
filename='1ake_007-nowater-core-dt240ps.dcd', | ||
url='https://ndownloader.figshare.com/files/8672074', | ||
checksum='598fcbcfcc425f6eafbe9997238320fcacc6a4613ecce061e1521732bab734bf', | ||
), | ||
} | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
def fetch_adk_equilibrium(data_home=None, download_if_missing=True): | ||
"""Load the AdK 1us equilibrium trajectory (without water) | ||
|
||
Parameters | ||
---------- | ||
data_home : optional, default: None | ||
Specify another download and cache folder for the datasets. By default | ||
all MDAnalysisData data is stored in '~/MDAnalysis_data' subfolders. | ||
This dataset is stored in ``<data_home>/adk_equilibrium``. | ||
download_if_missing : optional, default=True | ||
If ``False``, raise a :exc:`IOError` if the data is not locally available | ||
instead of trying to download the data from the source site. | ||
|
||
Returns | ||
------- | ||
dataset : dict-like object with the following attributes: | ||
dataset.topology : filename | ||
Filename of the topology file | ||
dataset.trajectory : filename | ||
Filename of the trajectory file | ||
dataset.DESCR : string | ||
Description of the trajectory. | ||
|
||
|
||
See :ref:`adk-equilibrium-dataset` for description. | ||
""" | ||
name = NAME | ||
data_location = join(get_data_home(data_home=data_home), | ||
name) | ||
if not exists(data_location): | ||
makedirs(data_location) | ||
|
||
records = Bunch() | ||
for file_type, meta in ARCHIVE.items(): | ||
local_path = join(data_location, meta.filename) | ||
records[file_type] = local_path | ||
|
||
if not exists(local_path): | ||
if not download_if_missing: | ||
raise IOError("Data {0}={1} not found and `download_if_missing` is " | ||
"False".format(file_type, local_path)) | ||
logger.info("Downloading {0}: {1} -> {2}...".format( | ||
file_type, meta.url, local_path)) | ||
archive_path = _fetch_remote(meta, dirname=data_location) | ||
|
||
module_path = dirname(__file__) | ||
with codecs.open(join(module_path, 'descr', DESCRIPTION), | ||
encoding="utf-8") as dfile: | ||
records.DESCR = dfile.read() | ||
|
||
return records | ||
class ADK_Equilibrium(Dataset): | ||
NAME = "adk_equilibrium" | ||
DESCRIPTION = "adk_equilibrium.rst" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One downside I can see is we've lost the short description that the function to generate this had There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't want to loose the description and don't want to loose the docs... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So currently it looks like:
So mostly there still |
||
|
||
# The original data can be found at the figshare URL. | ||
# The SHA256 checksum of the zip file changes with every download so we | ||
# cannot check its checksum. Instead we download individual files. | ||
# separately. The keys of this dict are also going to be the keys in the | ||
# Bunch that is returned. | ||
ARCHIVE = { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we normalize all of this then we might be able to just put all these data into JSON or YAML files. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, this is essentially py-son at this point |
||
'topology': RemoteFileMetadata( | ||
filename='adk4AKE.psf', | ||
url='https://ndownloader.figshare.com/files/8672230', | ||
checksum='1aa947d58fb41b6805dc1e7be4dbe65c6a8f4690f0bd7fc2ae03e7bd437085f4', | ||
), | ||
'trajectory': RemoteFileMetadata( | ||
filename='1ake_007-nowater-core-dt240ps.dcd', | ||
url='https://ndownloader.figshare.com/files/8672074', | ||
checksum='598fcbcfcc425f6eafbe9997238320fcacc6a4613ecce061e1521732bab734bf', | ||
), | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,9 +31,11 @@ | |
|
||
import shutil | ||
from collections import namedtuple | ||
from os import environ, listdir, makedirs | ||
from os import environ, listdir, makedirs, remove | ||
from os.path import dirname, exists, expanduser, isdir, join, splitext | ||
import hashlib | ||
import codecs | ||
|
||
|
||
|
||
|
||
|
@@ -94,6 +96,59 @@ def __setstate__(self, state): | |
RemoteFileMetadata = namedtuple('RemoteFileMetadata', | ||
['filename', 'url', 'checksum']) | ||
|
||
DATASET_NAMES = {} | ||
|
||
class _DatasetRegister(type): | ||
def __new__(meta, name, bases, class_dict): | ||
cls = type.__new__(meta, name, bases, class_dict) | ||
if not cls.NAME is None: | ||
DATASET_NAMES[cls.NAME] = cls | ||
return cls | ||
|
||
|
||
class Dataset(Bunch, metaclass=_DatasetRegister): | ||
NAME = None | ||
DESCRIPTION = None | ||
ARCHIVE = None | ||
|
||
def __init__(self, data_home=None, download_if_missing=True): | ||
data_location = join(get_data_home(data_home=data_home), | ||
self.NAME) | ||
|
||
if not exists(data_location): | ||
makedirs(data_location) | ||
|
||
contents = {} | ||
for file_type, meta in self.ARCHIVE.items(): | ||
local_path = join(data_location, meta.filename) | ||
contents[file_type] = local_path | ||
|
||
if not exists(local_path): | ||
if not download_if_missing: | ||
raise IOError("Data {0}={1} not found and `download_if_missing` is " | ||
"False".format(file_type, local_path)) | ||
logger.info("Downloading {0}: {1} -> {2}...".format( | ||
file_type, meta.url, local_path)) | ||
archive_path = _fetch_remote(meta, dirname=data_location) | ||
|
||
module_path = dirname(__file__) | ||
with codecs.open(join(module_path, 'descr', self.DESCRIPTION), | ||
encoding="utf-8") as dfile: | ||
contents['DESCR'] = dfile.read() | ||
|
||
|
||
# finally, init the Bunch object | ||
super().__init__(**contents) | ||
|
||
|
||
def fetch(dataset, data_home=None, download_if_missing=True): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This allows There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's a reason why explicit functions: tab completion and introspection. (sklearn does it and it works really well – much better than having to know the name of the dataset) I'd like to keep explicit functions – both for ease of use and for same "look and feel" as sklearn.datasets (as well as getting docs!) We can have a generic mechanism and generate the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's nice for this case. But did you look at some of the other accessors like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could put the compression/other info into the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok yeah the namespace is nice, we could implement the static functions as def fetch_adk():
return base.fetch('adk') There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
If anything, we should build data structures that contain Finally, have a look at sklearn.datasets (and the outstanding docs) to see the variance. I think one reason for copy&paste code is that ultimately each dataset in the wild might have slightly different requirements. Still, that's not to say that we can't try to get a bit of order in ;-). |
||
"""Grab a named dataset""" | ||
try: | ||
return DATASET_NAMES[dataset](data_home=data_home, | ||
download_if_missing=True) | ||
except KeyError: | ||
raise KeyError("unknown dataset: {}".format(dataset)) | ||
|
||
|
||
def get_data_home(data_home=None): | ||
"""Return the path of the MDAnalysisData data dir. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and another thing lost is the clear description of what the returned Bunch will have....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to loose this...