Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrating indexes to Zenodo - Genís&Guillem #641

Merged
merged 45 commits into from
Nov 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
ca55805
remote index example 1
genisplaja Oct 29, 2024
6b3d32e
rest of assigned datasets
genisplaja Oct 29, 2024
002cc47
fixes and fix multitrack sample indexes
genisplaja Oct 30, 2024
ea4cd1e
formatting
genisplaja Oct 30, 2024
7f6dab7
fixing VersionTest tests
genisplaja Oct 31, 2024
4e3e3f9
Last pending indexes migrated to zenodo
guillemcortes Nov 1, 2024
79f9201
Merge branch 'master' into genis/remote_indexes_genis
guillemcortes Nov 1, 2024
3f76f52
black formatting
guillemcortes Nov 1, 2024
385bcfa
Merge branch 'master' into genis/remote_indexes_genis
genisplaja Nov 2, 2024
1d7960e
expanding slakh tests
genisplaja Nov 2, 2024
7a07dd7
Merge remote-tracking branch 'origin/genis/remote_indexes_genis' into…
genisplaja Nov 2, 2024
97a2e08
fixes in slakh and tests for remote indexes
genisplaja Nov 2, 2024
4e695d2
Merge branch 'master' into genis/remote_indexes_genis
tanmayy24 Nov 3, 2024
1276786
Merge branch 'master' into genis/remote_indexes_genis
guillemcortes Nov 4, 2024
4509db6
ADD Cuidado and Simac remote indexes
guillemcortes Nov 4, 2024
1ba7484
fixes in cipi
genisplaja Nov 4, 2024
de6af78
merging...
genisplaja Nov 4, 2024
d3aa80f
Move sample indexes to tests folder
guillemcortes Nov 4, 2024
b0084bf
Merge remote-tracking branch 'origin/genis/remote_indexes_genis' into…
guillemcortes Nov 4, 2024
8975b18
fix test_core test indexes path
genisplaja Nov 4, 2024
376d439
Merge remote-tracking branch 'origin/genis/remote_indexes_genis' into…
genisplaja Nov 4, 2024
ecec932
Improve error message
guillemcortes Nov 4, 2024
d583d21
Define index_dir for test indexes
guillemcortes Nov 4, 2024
10fc3af
rename sample index simac
guillemcortes Nov 4, 2024
cdab141
black formatting
guillemcortes Nov 4, 2024
0794ad9
fix simac test
guillemcortes Nov 4, 2024
fd64c3f
ignore json indexes
guillemcortes Nov 4, 2024
62c08b9
Update PR template
guillemcortes Nov 4, 2024
5f7fad2
Update contributing documentation
guillemcortes Nov 4, 2024
9fc81a6
Update docs
guillemcortes Nov 4, 2024
345c981
Update example
guillemcortes Nov 4, 2024
9eac832
Update docs
guillemcortes Nov 4, 2024
0063a97
Tutorial section name update
guillemcortes Nov 4, 2024
03359fe
soundatas-->mirdata, fix upload_index ref
genisplaja Nov 5, 2024
0c6ff40
Removal of crema from testing indexes
Nov 5, 2024
193166f
Update contributing
guillemcortes Nov 5, 2024
2ca4cb1
Merge remote-tracking branch 'origin/genis/remote_indexes_genis' into…
guillemcortes Nov 5, 2024
f0cb0e5
support for dagstuhl multitracks
Nov 5, 2024
080e0d4
Specify to fork the repo in contributing docs
guillemcortes Nov 5, 2024
6b721b5
Merge branch 'master' into genis/remote_indexes_genis
guillemcortes Nov 5, 2024
b042d43
Fix LICENSE link
guillemcortes Nov 5, 2024
ee7d4a1
missing .json in index links
genisplaja Nov 5, 2024
22ff86f
move mdb_stem_synth to remote
genisplaja Nov 5, 2024
57a797d
missing version in tests
genisplaja Nov 5, 2024
c6161dc
minor formatting fixes in FAQ docs
genisplaja Nov 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/PULL_REQUEST_TEMPLATE/new_loader.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@ Please include the following information at the top level docstring for the data
#### Dataset loaders checklist:

- [ ] Create a script in `scripts/`, e.g. `make_my_dataset_index.py`, which generates an index file.
- [ ] Run the script on the canonical version of the dataset and save the index in `mirdata/indexes/` e.g. `my_dataset_index.json`.
- [ ] Run the script on the canonical version of the dataset and upload the index to [Zenodo Audio Data Loaders community](https://zenodo.org/communities/audio-data-loaders).
- [ ] Create a sample version of the index with the necessary information for testing.
- [ ] Create a module in mirdata, e.g. `mirdata/my_dataset.py`
- [ ] Create tests for your loader in `tests/datasets/`, e.g. `test_my_dataset.py`
- [ ] Add your module to `docs/source/mirdata.rst` and `docs/source/table.rst`
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ tests/resources/mir_datasets_full
tests/data/output.wav
tests/resources/mir_datasets/haydn_op20/op20n1-01.midi
mirdata/datasets/indexes/__MACOSX
mirdata/datasets/indexes/*.json
*.DS_Store

# Byte-compiled / optimized / DLL files
Expand Down
4 changes: 2 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ mirdata



``mirdata`` is an open-source Python library that provides tools for working with common Music Information Retrieval (MIR) datasets, including tools for:
Mirdata is an open-source Python library that provides tools for working with common Music Information Retrieval (MIR) datasets, including tools for:

* downloading datasets to a common location and format
* validating that the files for a dataset are all present
Expand Down Expand Up @@ -41,7 +41,7 @@ If you refer to mirdata's design principles, motivation etc., please cite the fo
"mirdata: Software for Reproducible Usage of Datasets."
In Proceedings of the 20th International Society for Music Information Retrieval (ISMIR) Conference, 2019.:

When working with datasets, please cite the version of ``mirdata`` that you are using (given by the ``DOI`` above)
When working with datasets, please cite the version of Mirdata that you are using (given by the ``DOI`` above)
**AND** include the reference of the dataset, which can be found in the respective dataset loader using the ``cite()`` method.


Expand Down
181 changes: 130 additions & 51 deletions docs/source/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,28 +18,20 @@ to, please tag your PR with ``please-do-not-edit``.
Installing mirdata for development purposes
###########################################

To install ``mirdata`` for development purposes:
To install Mirdata for development purposes:

* First run:
- First, fork the Mirdata repository on GitHub and clone your fork locally.

.. code-block:: console
- Then, after opening source data library you have to install all the dependencies:

git clone https://github.com/mir-dataset-loaders/mirdata.git

* Then, after opening source data library you have to install the dependencies for updating the documentation
and running tests:

.. code-block:: console

pip install .
pip install ."[tests]"
pip install ."[docs]"
pip install ."[dali]"
pip install ."[haydn_op20]"
- Install Core dependencies with ``pip install .``
- Install Testing dependencies with ``pip install ."[tests]"``
- Install Docs dependencies with ``pip install ."[docs]"``
- Install dataset-specific dependencies with ``pip install ."[dataset]"`` where ``dataset`` can be ``dali | haydn_op20 | cipi ...``


We recommend to install `pyenv <https://github.com/pyenv/pyenv#installation>`_ to manage your Python versions
and install all ``mirdata`` requirements. You will want to install the latest supported Python versions (see README.md).
and install all Mirdata requirements. You will want to install the latest supported Python versions (see README.md).
Once ``pyenv`` and the Python versions are configured, install ``pytest``. Make sure you installed all the necessary pytest
plugins to automatically test your code successfully (e.g. `pytest-cov`). Finally, run:

Expand Down Expand Up @@ -72,36 +64,41 @@ Writing a new dataset loader
#############################


The steps to add a new dataset loader to ``mirdata`` are:
The steps to add a new dataset loader to Mirdata are:

1. `Create an index <create_index_>`_
2. `Create a module <create_module_>`_
3. `Add tests <add_tests_>`_
4. `Submit your loader <submit_loader_>`_
4. `Update Mirdata documentation <update_docs_>`_
5. `Upload index to Zenodo <upload_index_>`_
6. `Create a Pull Request on GitHub <create_pr_>`_


Before starting, check if your dataset falls into one of these non-standard cases:
guillemcortes marked this conversation as resolved.
Show resolved Hide resolved

* Is the dataset not freely downloadable? If so, see `this section <not_open_>`_
* Does the dataset require dependencies not currently in mirdata? If so, see `this section <extra_dependencies_>`_
* Does the dataset have multiple versions? If so, see `this section <multiple_versions_>`_
* Is the index large (e.g. > 5 MB)? If so, see `this section <large_index_>`_


.. _create_index:

1. Create an index
------------------

``mirdata``'s structure relies on `indexes`. Indexes are dictionaries contain information about the structure of the
dataset which is necessary for the loading and validating functionalities of ``mirdata``. In particular, indexes contain
Mirdata's structure relies on `indexes`. Indexes are dictionaries contain information about the structure of the
dataset which is necessary for the loading and validating functionalities of Mirdata. In particular, indexes contain
information about the files included in the dataset, their location and checksums. The necessary steps are:

1. To create an index, first create a script in ``scripts/``, as ``make_dataset_index.py``, which generates an index file.
2. Then run the script on the dataset and save the index in ``mirdata/datasets/indexes/`` as ``dataset_index_<version>.json``.
where <version> indicates which version of the dataset was used (e.g. 1.0).
3. When the dataloader is completed and the PR is accepted, upload the index in our `Zenodo community <https://zenodo.org/communities/audio-data-loaders/>`_. See more details `here <upload_index_>`_.


The function ``make_<datasetname>_index.py`` should automate the generation of an index by computing the MD5 checksums for given files in a dataset located at data_path.
Users can adapt this function to create an index for their dataset by adding their file paths and using the md5 function to generate checksums for their files.

.. _index example:

Here there is an example of an index to use as guideline:
Expand All @@ -114,6 +111,9 @@ Here there is an example of an index to use as guideline:

More examples of scripts used to create dataset indexes can be found in the `scripts <https://github.com/mir-dataset-loaders/mirdata/tree/master/scripts>`_ folder.

.. note::
Users should be able to create the dataset indexes without the need for additional dependencies that are not included in Mirdata by default. Should you need an additional dependency for a specific reason, please open an issue to discuss with the Mirdata maintainers the need for it.

tracks
^^^^^^

Expand Down Expand Up @@ -302,6 +302,77 @@ You may find these examples useful as references:
For many more examples, see the `datasets folder <https://github.com/mir-dataset-loaders/mirdata/tree/master/mirdata/datasets>`_.


Declare constant variables
^^^^^^^^^^^^^^^^^^^^^^^^^^
Please, include the variables ``BIBTEX``, ``INDEXES``, ``REMOTES``, and ``LICENSE_INFO`` at the beginning of your module.
While ``BIBTEX`` (including the bibtex-formatted citation of the dataset), ``INDEXES`` (indexes urls, checksums and versions),
and ``LICENSE_INFO`` (including the license that protects the dataset in the dataloader) are mandatory, ``REMOTES`` is only defined if the dataset is openly downloadable.

``INDEXES``
As seen in the example, we have two ways to define an index:
providing a URL to download the index file, or by providing the filename of the index file, assuming it is available locally (like sample indexes).

* The full indexes for each version of the dataset should be retrieved from our Zenodo community. See more details `here <upload_index_>`_.
* The sample indexes should be locally stored in the ``tests/indexes/`` folder, and directly accessed through filename. See more details `here <add_tests_>`_.

**Important:** We do recommend to set the highest version of the dataset as the default version in the ``INDEXES`` variable.
However, if there is a reason for having a different version as the default, please do so.

When defining a remote index in ``INDEXES``, simply also pass the arguments ``url`` and ``checksum`` to the ``Index`` class:

.. code-block:: python

"1.0": core.Index(
filename="example_index_1.0.json", # the name of the index file
url=<url>, # the download link
checksum=<checksum>, # the md5 checksum
)

Remote indexes get downloaded along with the data when calling ``.download()``, and are stored in ``<data_home>/mirdata/datasets/indexes``.

``REMOTES``
Should be a list of ``RemoteFileMetadata`` objects, which are used to download the dataset files. See an example below:

.. code-block:: javascript
REMOTES = {
"annotations": download_utils.RemoteFileMetadata(
filename="The Beatles Annotations.tar.gz",
url="http://isophonics.net/files/annotations/The%20Beatles%20Annotations.tar.gz",
checksum="62425c552d37c6bb655a78e4603828cc",
destination_dir="annotations",
),
}

Add more ``RemoteFileMetadata`` objects to the ``REMOTES`` dictionary if the dataset is split into multiple files.
Please use ``download_utils.RemoteFileMetadata`` to parse the dataset from an online repository, which takes cares of the download process and the checksum validation, and addresses corner carses.
Please do NOT use specific functions like ``download_zip_file`` or ``download_and_extract`` individually in your loader.

.. note::
Direct url for download and checksum can be found in the Zenodo entries of the dataset and index. Bear in mind that the url and checksum for the index will be available once a maintainer of the Audio Data Loaders Zenodo community has accepted the index upload.
For other repositories, you may need to generate the checksum yourself.
You may use the function provided in ``mirdata.validate.py``.


Make sure to include, in the docstring of the dataloader, information about the following list of relevant aspects about the dataset you are integrating:

* The dataset name.
* A general purpose description, the task it is used for.
* Details about the coverage: how many clips, how many hours of audio, how many classes, the annotations available, etc.
* The license of the dataset (even if you have included the ``LICENSE_INFO`` variable already).
* The authors of the dataset, the organization in which it was created, and the year of creation (even if you have included the ``BIBTEX`` variable already).
* Please reference also any relevant link or website that users can check for more information.
.. note::
In addition to the module docstring, you should write docstrings for every new class and function you write. See :ref:`the documentation tutorial <documentation_tutorial>` for practical information on best documentation practices.
This docstring is important for users to understand the dataset and its purpose.
Having proper documentation also enhances transparency, and helps users to understand the dataset better.
Please do not include complicated tables, big pieces of text, or unformatted copy-pasted text pieces.
It is important that the docstring is clean, and the information is very clear to users.
This will also engage users to use the dataloader!
For many more examples, see the `datasets folder <https://github.com/mir-dataset-loaders/mirdata/tree/master/mirdata/datasets>`_.
.. note::
If the dataset you are trying to integrate stores every clip in a separated compressed file, it cannot be currently supported by Mirdata. Feel free to open and issue to discuss a solution (hopefully for the near future!)


.. _add_tests:

3. Add tests
Expand Down Expand Up @@ -399,9 +470,7 @@ kindly ask the contributors to **reduce the size of the testing data** if possib
csv files).


.. _submit_loader:

4. Submit your loader
4. Update Mirdata documentation
---------------------

Before you submit your loader make sure to:
Expand Down Expand Up @@ -433,16 +502,50 @@ An example of this for the ``Beatport EDM key`` dataset:
(you can check that this was done correctly by clicking on the readthedocs check when you open a PR). You can find license
badges images and links `here <https://gist.github.com/lukas-h/2a5d00690736b4c3a7ba>`_.

Pull Request template
^^^^^^^^^^^^^^^^^^^^^

When starting your PR please use the `new_loader.md template <https://github.com/mir-dataset-loaders/mirdata/blob/master/.github/PULL_REQUEST_TEMPLATE/new_loader.md>`_,
.. _upload_index:

5. Uploading the index to Zenodo
--------------------------------

We store all dataset indexes in an online repository on Zenodo.
To use a dataloader, users may retrieve the index running the ``dataset.download()`` function that is also used to download the dataset.
To download only the index, you may run ``.download(["index"])``. The index will be automatically downloaded and stored in the expected folder in Mirdata.

From a contributor point of view, you may create the index, store it locally, and develop the dataloader.
All JSON files in ``mirdata/indexes/`` are included in the .gitignore file,
therefore there is no need to remove it when pushing to the remote branch during development, since it will be ignored by git.

**Important!** When creating the PR, please `submit your index to our Zenodo community <https://zenodo.org/communities/audio-data-loaders/>`_:

* First, click on ``New upload``.
* Add your index in the ``Upload files`` section.
* Let Zenodo create a DOI for your index, so click *No*.
* Resource type is *Other*.
* Title should be *mirdata-<dataset-id>_index_<version>*, e.g. mirdata-beatles_index_1.2.
* Add yourself as the Creator of this entry.
* The license of the index should be the `same as Mirdata <https://github.com/mir-dataset-loaders/mirdata/blob/master/LICENSE>`_.
* Visibility should be set as *Public*.

.. note::
*<dataset-id>* is the identifier we use to initialize the dataset using ``mirdata.initialize()``. It's also the filename of your dataset module.


.. _create_pr:

6. Create a Pull Request
------------------------

Please, create a Pull Request with all your development. When starting your PR please use the `new_loader.md template <https://github.com/mir-dataset-loaders/mirdata/blob/master/.github/PULL_REQUEST_TEMPLATE/new_loader.md>`_,
it will simplify the reviewing process and also help you make a complete PR. You can do that by adding
``&template=new_loader.md`` at the end of the url when you are creating the PR :

``...mir-dataset-loaders/mirdata/compare?expand=1`` will become
``...mir-dataset-loaders/mirdata/compare?expand=1&template=new_loader.md``.

.. _update_docs:


Docs
^^^^

Expand Down Expand Up @@ -584,30 +687,6 @@ could look like:
}


.. _large_index:

Datasets with large indexes
---------------------------

Large indexes should be stored remotely, rather than checked in to the mirdata repository.
mirdata has a `zenodo community <https://zenodo.org/communities/mirdata/?page=1&size=20>`_
where larger indexes can be uploaded as "datasets".

When defining a remote index in ``INDEXES``, simply also pass the arguments ``url`` and
``checksum`` to the ``Index`` class:

.. code-block:: python

"1.0": core.Index(
filename="example_index_1.0.json", # the name of the index file
url=<url>, # the download link
checksum=<checksum>, # the md5 checksum
)

Remote indexes get downloaded along with the data when calling ``.download()``,
and are stored in ``<data_home>/mirdata_indexes``.


Documentation
#############

Expand Down
28 changes: 17 additions & 11 deletions docs/source/contributing_examples/example.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,18 +43,24 @@
"""

# -- INDEXES specifies different versions of a dataset
# -- "default" and "test" specify which key should be used
# -- by default, and when running tests.
# -- Some datasets have a "sample" version, which is a mini-version
# -- that makes it easier to try out a large dataset without needing
# -- to download the whole thing.
# -- If there is no sample version, simply set "test": "1.0".
# -- If the default data is remote, there must be a local sample for tests!
# -- "default" and "test" specify which key should be used by default and when running tests
# -- Each index is defined by {"version": core.Index instance}
# -- | filename: index name
# -- | url: Zenodo direct download link of the index (will be available afer the index upload is
# -- accepted to Audio Data Loaders Zenodo community).
# -- | checksum: Checksum of the index hosted at Zenodo.
# -- Direct url for download and checksum can be found in the Zenodo entry of the dataset.
# -- Sample index is a mini-version that makes it easier to test a large datasets.
# -- There must be a local sample index for testing for each remote index.
INDEXES = {
"default": "1.0",
"default": "1.2",
"test": "sample",
"1.0": core.Index(filename="example_index_1.0.json"),
"sample": core.Index(filename="example_index_sample.json")
"1.2": core.Index(
filename="beatles_index_1.2.json",
url="https://zenodo.org/records/14007830/files/beatles_index_1.2.json?download=1",
checksum="6e1276bdab6de05446ddbbc75e6f6cbe",
),
"sample": core.Index(filename="beatles_index_1.2_sample.json"),
}

# -- REMOTES is a dictionary containing all files that need to be downloaded.
Expand Down Expand Up @@ -248,7 +254,7 @@ def to_jams(self):
return jams_utils.jams_converter(
audio_path=self.mix_path,
annotation_data=[(self.annotation, None)],
...
#...
)
# -- see the documentation for `jams_utils.jams_converter for all fields

Expand Down
Loading
Loading