This guide explains how to create, update and test Nextclade datasets as well as how to contribute them into the official Nextclade dataset collection.
⚠️ If you are a user of Nextclade Web or Nextclade CLI and looking for documentation on how to use Nextclade, see Nextclade user documentation instead.
⚠️ If you are looking for Nextclade software developer documentation, see Nextclade developer guide instead.
⚠️ This guide serves for advanced Nextclade users and enthusiasts who want to create and maintain their own Nextclade datasets, e.g. to add a yet unsupported pathogen or strain. It assumes basic familiarity with Nextclade CLI and Nextclade Web and some experience with different datasets as a user. If you are not yet comfortable using Nextclade and want to learn more about Nextclade datasets, please refer to the Nextclade user documentation first.
Nextclade software is built to be agnostic to pathogens it analyzes. Instead, the information about particularities of certain pathogens is provided in the form of so-called Nextclade datasets. A Nextclade dataset is a set of predefined files (in a directory or in a zip archive) which adds support for a particular pathogen or a strain to Nextclade CLI and Nextclade Web.
In this repository:
-
The
data/
directory contains source data for datasets. It contains datasets grouped into dataset collections. For technical purposes, a dataset is defined as a leaf directory containing apathogen.json
file. It can be a part of the arbitrarily nested directory tree. -
The rebuild script (
./scripts/rebuild
) processes all dataset sources from thedata/
directory and producesdata_output/
directory. -
The
data_output/
directory is the current snapshot of the dataset server state. It contains versioned datasets as well as index files containing metadata which is necessary for Nextclade software to operate.⚠️ Never modifydata_output/
directory! All manual changes will be overwritten by the automation. -
The contents of
data_output/
directory is deployed to the production dataset servers (see Dataset server maintenance guide), which makes it available to all Nextclade Web and Nextclade CLI users. The GitHub URL to this directory can also be used as a temporary substitute for a dataset server (in this case GitHub website acts as a server for us).
If you already have a dataset for Nextclade v2 and want to upgrade it to Nextclade v3, see instructions in the Dataset migration guide.
We use GitHug pull requests to manage contributions to Nextclade datasets.
In order to add or modify datasets you will need to have a local copy of the GitHub repository nextstrain/nextclade_data
on your computer, make the desired changes, commit & push the changes to a new git branch, and submit a pull request to nextstrain/nextclade_data
. The pull request will be reviewed by Nextclade maintainers and considered for inclusion to the Nextclade dataset collection.
Make sure you have git installed, have an account on GitHub and can pull and push code from and to GitHub repositories.
Make a fork of the nextstrain/nextclade_data upstream repository and clone your forked repository:
git clone [email protected]:<your_github_username>/nextclade_data
Add changes to your forked repository, commit and submit a pull request to the upstream repository nextstrain/nextclade_data
. When submitting, tick the checkbox "Allow edits from maintainers", so that nextstrain-bot
and project maintainers can contribute back to your pull request branch.
Refer to GitHub documentation "Contributing to projects" for more details.
💡 Make sure you keep your local code up to date with the origin repo, especially if it's forked.
💡 If you are a member of Nextstrain team, then you don't need a fork, and you can contribute directly to the origin repository
nextstrain/nextclade_data
. Nonetheless, please still submit a pull requests for review, rather than pushing changes to branches directly.
This section describes a sequence of steps to add a new Nextclade dataset for a pathogen or a strain. It assumes that you have a local copy of the nextstrain/nextclade_data repo available.
-
Optionally, create a new git branch. Give it a name that briefly explains planned additions. Refer to documentation of git and of GitHub for more details.
-
Add a directory under
data/community/<dataset_path>
.⚠️ We will discuss some important considerations about dataset paths in the Dataset paths section later on. If you want to submit your dataset to the community dataset collection, please follow these recommendations carefully.💡 If you are a member of Nextstrain organization, then submit to the "nextstrain" collection (
data/nextstrain
), rather than to "community" (data/community
). -
Add
pathogen.json
file. Usepathogen.json
files of the existing datasets as a template and modify them as needed. -
Add a
CHANGELOG.md
file containing second-level heading## Unreleased
(spelled exactly like this) and free-form text under it, describing proposed changes. You can use Markdown syntax. In this case you can write that it's the first release of this dataset.⚠️ It is important to name the section exactly: two hashes, space and the word "Unreleased", starting with the capital letter "U". This text will be used to automatically find and extract release notes, which are then published along with the next dataset release. -
Add remaining dataset files. At a very minimum, you should have required files:
reference.fasta
,pathogen.json
andCHANGELOG.md
. See dataset creation guide for a detailed walkthrough. -
Optionally, test your dataset locally
-
Commit and push your changes to your forked repository on GitHub. Refer to documentation of git and of GitHub for more details.
-
Submit your changes as a pull request to the
nextstrain/nextclade_data
repository on GitHub.⚠️ Note thatnextstrain-bot
will automatically run the rebuild script, and will commit changes to thedata/
anddata_output/
directories on your git branch. Don't forget to pull these changes if you are going to make more commits. -
Optionally, test your dataset from GitHub
-
Wait for maintainers and community members to review and either accept or reject your proposal. Be ready to discuss the proposed changes, and to apply modifications if requested.
-
Find your dataset in the
data/
directory and modify dataset files as you see fit. -
On the very top of the
CHANGELOG.md
, add a second-level heading## Unreleased
(spelled exactly like this) and a paragraph under it describing proposed changes in free form. You can use the usual Markdown syntax. Keep records for previous releases in place. If there's already an## Unreleased
section (meaning this dataset already has changes that are yet to be released) append the summary of your changes to the existing summary. -
Submit the result for consideration as a pull request (similarly to how it was done in the "Adding a new dataset" section)
-
Wait for maintainers and community members to review and either accept or reject your proposal. Be ready to discuss the proposed changes, and to apply some modifications if requested.
⚠️ Once released, datasets cannot be removed!
⚠️ Note that because the GitHub repository is public, we cannot control any copies of your data that people could be making before it is removed from our repository. Never commit and push any sensitive or copyrighted data. Remember: once something is on the internet, it stays there forever!
⚠️ Before submitting something, make sure you have correct legal rights and/or author's permission to do so. We recommend to acknowledge authors of all materials and require to do so if the materials require acknowledgment.
If your pull request got merged, the changes weren't yet released, and you want to remove them, then open a pull request which rolls back your previously submitted changes as soon as possible.
If your pull request is not merged yet, simply close the pull request. Explain your decision in a comment message.
If you want to signal users that a dataset is no longer maintained, is inaccurate or obsolete, rather than deleting it you can set the field "deprecated": true
in pathogen.json
file. In this case the dataset will be listed in Nextclade with a "deprecated" badge and at the bottom of the list. Please explain the reason for deprecation in the changelog section (as described in the usual update steps), and add some details to the readme file if there is one, so that users could make an informed decision themselves whether to use it or not.
If your pull request is accepted and merged, your data enters master
branch and is automatically deployed to the master
environment. In a few minutes after merge it should be visible at https://master.clades.nextstrain.org
Nextclade server maintainers, after ensuring that there is no obvious bugs or crashes, periodically trigger deployment of the changes accumulated on master
to the release
environment, and then it becomes visible at the main site at https://clades.nextstrain.org.
There is no set schedule for releases, so let us know in the pull request if you have any particular preference.
The path of a given dataset directory in this repository is used to uniquely indentify this dataset, and has important consequences for its usage in Nextclade software. In this section we explain the best practices for Nextclade dataset paths.
If you want to submit your dataset for consideration for the inclusion to the Nextclade community datasets, then we ask that you follow certain conventions for the path segments when placing files and creating your pull request.
We currently recommend the following directory structure for your dataset directories:
data/community/your-org/your-name/pathogen-name/strain-name/other/features/
Let's split this path into segments and describe meaning of each segment:
data/
: this is the root of the dataset collection storagecommunity/
: this is the root of the "community" dataset collectionyour-org/your-name/
: replace these path segments with your organization name, if you are submitting on behalf of an organization, as well as with your GitHub nickname. This way every organization (and organization member) has its own directory. Feel free to create some nested structure relevant for your organization - you can nest the subdirectories arbitrarily and this is not limited to 2 levels. We only ask to not submit datasets directly into thecommunity/
, to avoid clashes between datasets from different authors and organizations.pathogen-name/strain-name/
: in these segments feel free to use the name of the pathogen and potentially strain name and/or an accession of the reference sequence. Again, you can nest the subdirectories arbitrarily and this is not limited to 2 levels.other/features/
: if you need some more levels of path segments to describe a particular dataset, for example a particular geographic location, time period or a host organism, then you can create additional path segments for it.
Note, that this structure is only relevant if you want to submit your datasets into this repository. You can of course use any dataset directory structure on your local computer or in your own repositories.
-
Dataset paths are used as dataset identifiers, for example in an argument of Nextclade CLI invocation and in URL parameters of Nextclade WEb, so they should not be excessively long.
-
These paths are used in directory names and URLs, so please avoid using spaces and special characters. Prefer lowercase letters and dashes (
-
) over underscores (_
) where possible. -
Meaningful, readable names and directory structure is encouraged. At the time of submission of the pull request, please avoid temporary names or names that cannot be understood by the potential users of the dataset.
-
Be consistent in your naming conventions to avoid confusion. For example, choose between "flu" and "influenza", stick to it, and avoid mixing both names. You can explain your naming and other choices in the
README.md
file, which will be visible for all users. -
In order to allow reproducibility of Nextclade analysis results, released datasets are immutable. Once a dataset is submitted and released, it cannot be revoked and the path cannot be changed! You can of course later submit the same dataset under a new path, but this will lose continuity of versioning - your potential users won't know about this new dataset and will not be able to receive updates, because they will still be using the old path (for example hardcoded in their analysis pipeline's code). So design your dataset path hierarchies carefully and with consideration for further updates and potential future additions, such that the paths are final and won't need to be modified.
Take a look at the "nextstrain" dataset collection for concrete examples of how maintainers of official dataset collection like to organize their datasets.
The guide in Test datasets locally describes how to change the datasets server URL used by Nextclade and Nextclade CLI as well as how to run a local dataset server. This is useful for testing the datasets modifications locally.
Once the pull request is submitted, a GitHub Actions instance will be started. It will run ./scripts/rebuild
and commit and push the produced build to the data_output/
directory on behalf of nextstrain-bot
GitHub user. You will see the new commits appearing in your pull request. Once this is in place, the link to the data_output/
GitHub directory can be used in Nextclade Web and Nextclade CLI as an alternative dataset server:
- First wait the GitHub Action runs in the "checks" section in your pull request to complete successfully.
- Wait for
nextstrain-bot
commit to appear in the list of commits of the pull request (refresh the page if needed). - Obtain a full URL to the
data_output/
directory on your pull request's branch. For example, you can select your branch on GitHub, navigate todata_output/
in the directory tree and copy the resulting URL in the address bar of your browser. - In Nextclade CLI, use
--server=<github_url>
argument forrun
anddataset get
commands. - In Nextclade Web, add
dataset-server=<github_url>
URL parameter.
This will tell Nextclade to fetch your modified datasets right from GitHub (GitHub acts a dataset "server" here).
If you don't want to submit your datasets to this repository, or prefer to deploy your datasets to your own GitHub repository or to your own web server, then follow Dataset server maintenance guide for how to set it up.
For all questions, ideas, bug reports relevant to datasets, please open a GitHub issue in this repository or join our discussion forum.