Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EB-140: Investigate if alias files can be used to recode more meaningful scaffold names #74

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

brinkdp
Copy link
Collaborator

@brinkdp brinkdp commented Nov 28, 2024

Recently, we have discussed that the scaffold names displayed in the dropdown in the JBrowse session are unnecessarily complex, as they consist of the just the sequence accession numbers. It would be desirable to change this into something more self-explanatory, or, at least, to scaffold names used internally by the research groups. For instance, chr1is often used as a short-hand for chromosome 1 for assemblies that are considered to have chromosome-level completeness.

This PR is to investigate if alias files can be used to recode more meaningful scaffold names using a feature update in JBrowse v2.14.0. The new feature introduces a NcbiSequenceReportAliasAdapter designed to allow NCBI sequence_report.tsv to be used as a refNameAlias and from there recode the displayed names of the scaffold in the JBrowse session.

The following experiment is focused on testing the feature in a reproducible manner to evaluate if it is something that we are interested in pursuing. This test implementation is not fully compatible with the current logic of the makefile. Instead it uses jq to do an a posteri value adjustment to the final config.json to specify that NcbiSequenceReportAliasAdapter should be used instead of RefNameAliasAdapter. Specifying this in the initial config.json did not work, as it was overwritten when make runs the JBrowse CLI. The sequence_report.tsv used in this example uses a modified version of the file used in the JBrowse PR populated with the data from the L. tenue assembly. ABCDE is a placeholder to check that the column was not used for the aliasing.

Commands:

SWG_TAG=local ./scripts/dockermake --test SPECIES=recode_names
jq '.assemblies[].refNameAliases.adapter.type = "NcbiSequenceReportAliasAdapter"' tests/data/recode_names/config.json > tests/data/recode_names/tmp.$$.json && mv tests/data/recode_names/tmp.$$.json tests/data/recode_names/config.json
cp tests/fixtures/recode_names/sequence_report.tsv tests/data/recode_names
docker stop jb2-recode_names; SWG_TAG=local ./scripts/browse tests/data/recode_names

Result:
Screenshot 2024-11-28 at 15 36 25

The short-form L. tenue scaffold names now display as desired! However, there are some things to consider from this implementation:

  • For this example, the aliasing works for the Figshare hosted files (i.e. the features are rendered as intended, see figure above). This is since both the GFF and the BED refer to the the short names of the scaffolds. Honestly, I think I was lucky in this case. My current understanding is that NcbiSequenceReportAliasAdapter only support two scaffold name synonyms to be aliased, which is reasonable given the scope of the original PR. In this case, there were only two synonyms: the ENA-formatted fasta header and the "Figshare"-formatted fasta header. But we have previously assumed that we might need three synonyms at times if we need to display ENA assembly, NCBI GFF, and research group track using different header formatting together.
  • The ordering or the recoded scaffold names is still based on the original scaffold names. It would have been great to e.g. place LG10 after LG9, and the chloroplast (CHL) between the primary and mitochondrial scaffolds. Changing the row order in sequence_report.tsv to reflect this is ignored in the final session, indicating that the original names take precedence for sorting.
  • In contrast to the former bullet, the defaultSession calls do need to be made to the recoded name and not to the original scaffold name. For instance: '.defaultSession.views[0].displayedRegions[0].refName' was changed to LG1 from ENA|CAMGYJ010000002|CAMGYJ010000002.1 to achieve the new working defaultSession.

In all, my impression at the time of writing is that I'm satisfied that this gives the desired results. The downside is that it might limit the aliasing to two synonyms and that we would need to implement a non-jq way to pass '.assemblies[].refNameAliases.adapter.type = "NcbiSequenceReportAliasAdapter" to config.json.

I would also be happy to post these questions the JBrowse developers to see what they think. Perhaps a new adapter type would need to be developed to support more synonyms and custom ordering, who knows.

What do you think @apfuentes? Is the result like you envisioned?
What are your thoughts, @kwentine?

Experiment synopsis: run dockermake --test, change .assemblies[].refNameAliases.adapter.type from RefNameAliasAdapter to NcbiSequenceReportAliasAdapter, copy of sequence_report.tsv to tests/data/recode_names, run browse script
@kwentine
Copy link
Collaborator

kwentine commented Jan 7, 2025

Is this experiment still relevant after #78 made alias file generation part of the data build process ? I would be happy to discuss it to understand the use case better.

And regarding the multiple synonym issue: for regular alias files, I tested that the same refname can appear on multiple lines:

chr1 foo
chr1 bar

will alias chr1 as both foo and bar in the JBrowse interface. Maybe a similar trick would work in your case ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants