Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TaxDict breaks if passed lineage with unlabeled terminal rank #4

Open
tanaes opened this issue Feb 2, 2016 · 3 comments
Open

TaxDict breaks if passed lineage with unlabeled terminal rank #4

tanaes opened this issue Feb 2, 2016 · 3 comments
Assignees
Labels

Comments

@tanaes
Copy link
Collaborator

tanaes commented Feb 2, 2016

I noticed this when parsing a list of insect taxa from genbank with available genomic information. When I attempted to make a TaxDict object with the list of taxa, it failed.

host = ['Unclassified Trichoceridae']

resolved_host = Resolver(terms=host)
resolved_host.main()

taxonomy = ['subspecies', 'species', 'genus',
            'family', 'order', 'class', 'phylum', 'kingdom']

idents = resolved_host.retrieve('query_name')

lineages = resolved_host.retrieve('classification_path')

ranks = resolved_host.retrieve('classification_path_ranks')

print([(ranks[0][x],lineages[0][x]) for x in range(len(ranks[0]))])
[('superkingdom', 'Eukaryota'), ('', 'Opisthokonta'), ('kingdom', 'Metazoa'), ('', 'Eumetazoa'), ('', 'Bilateria'), ('', 'Protostomia'), ('', 'Ecdysozoa'), ('', 'Panarthropoda'), ('phylum', 'Arthropoda'), ('', 'Mandibulata'), ('', 'Pancrustacea'), ('superclass', 'Hexapoda'), ('class', 'Insecta'), ('', 'Dicondylia'), ('', 'Pterygota'), ('subclass', 'Neoptera'), ('infraclass', 'Endopterygota'), ('order', 'Diptera'), ('suborder', 'Nematocera'), ('infraorder', 'Psychodomorpha'), ('superfamily', 'Trichoceroidea'), ('family', 'Trichoceridae'), ('', 'Unclassified')]

In this case, the terminal lineage entry is 'Unclassified' with no assigned rank, causing _getLevel to fail on initialization of the TaxRef object:

taxdict = TaxDict(idents=idents, ranks=ranks, lineages=lineages,
                  taxonomy=taxonomy)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-0af07c3e75e4> in <module>()
      1 taxdict = TaxDict(idents=idents, ranks=ranks, lineages=lineages,
----> 2                   taxonomy=taxonomy)

/Users/jonsanders/Development/git_sw/TaxonNamesResolver/taxon_names_resolver/manip_tools.py in __init__(self, idents, ranks, lineages, taxonomy, **kwargs)
    115             # create taxref
    116             taxref = TaxRef(ident=idents[i], rank=ranks[i][-1],
--> 117                             taxonomy=self.taxonomy)
    118             # create key for ident and insert a dictionary of:
    119             #  lineage, taxref, cident, ident and rank

/Users/jonsanders/Development/git_sw/TaxonNamesResolver/taxon_names_resolver/manip_tools.py in __init__(self, ident, rank, taxonomy)
     34         except ValueError as e:
     35             print('Error in taxon ident: {}'.format(ident))
---> 36             raise e
     37         super(TaxRef, self).__setattr__('counter', 0)  # count ident changes
     38 

/Users/jonsanders/Development/git_sw/TaxonNamesResolver/taxon_names_resolver/manip_tools.py in __init__(self, ident, rank, taxonomy)
     31         try:
     32             super(TaxRef, self).__setattr__('level',
---> 33                                         self._getLevel(rank, taxonomy))
     34         except ValueError as e:
     35             print('Error in taxon ident: {}'.format(ident))

/Users/jonsanders/Development/git_sw/TaxonNamesResolver/taxon_names_resolver/manip_tools.py in _getLevel(self, rank, taxonomy)
     54             return taxonomy.index(rank)
     55         # else find its closest by using the default taxonomy
---> 56         dlevel = default_taxonomy.index(rank)
     57         i = 1
     58         d = dlevel + i

ValueError: '' is not in list

Not sure what the best way to resolve this should be.

  1. Could add a catch in _getLevel to make sure the query rank is present in the default taxonomy before it tries to index, otherwise return 'Unknown' or similar.

  2. Rather than simply passing the terminal rank to the TaxRef constructor, look for the most terminal labeled rank that is present in either the provided or default taxonomy.

Any thoughts?

@tanaes tanaes self-assigned this Feb 2, 2016
@tanaes
Copy link
Collaborator Author

tanaes commented Feb 3, 2016

Went with option (2) in commit e14cec0

I think this more explicitly addresses the failure mode while being less likely to cover up other potential problems...

@tanaes tanaes added the bug label Feb 3, 2016
@DomBennett
Copy link
Owner

Hi Jon,

I tried out your option(2) solution and I think it's a good one. By searching for the lowest shared rank between the given taxonomy and the resolved taxonomy, you're identifying the best match in a repeatable way. In the case of your unidentified insect family, its returning the lowest identifiable level, Trichoceridae which is a family. In the hypothteical situation where the last 2 ranks are unknown, your method would be able to handle it also. It should also raise an error if a resolved name only contains unidentified ranks.

I would say, however, that it may not be a good idea to search for the low_rank in both the given and default taxonomy. It certain cases you may not want the level of resolution provided by the default. Perhaps you could therefore improve your code by raising an error saying something like "i contains no ranks present in taxonomy".

Feel free to push your solution to the master and I can play with it a bit more.

@tanaes
Copy link
Collaborator Author

tanaes commented Feb 3, 2016

I was wondering how that would interface with your intentions for the
dlevel process in _getLevel. I guess the preference would be to default to
the higher-level rank... I'll put in an exception catch and merge that
code.
On Wed, Feb 3, 2016 at 6:21 AM Dominic Bennett [email protected]
wrote:

Hi Jon,

I tried out your option(2) solution and I think it's a good one. By
searching for the lowest shared rank between the given taxonomy and the
resolved taxonomy, you're identifying the best match in a repeatable way.
In the case of your unidentified insect family, its returning the lowest
identifiable level, Trichoceridae which is a family. In the hypothteical
situation where the last 2 ranks are unknown, your method would be able to
handle it also. It should also raise an error if a resolved name only
contains unidentified ranks.

I would say, however, that it may not be a good idea to search for the
low_rank in both the given and default taxonomy. It certain cases you may
not want the level of resolution provided by the default. Perhaps you could
therefore improve your code by raising an error saying something like "i
contains no ranks present in taxonomy".

Feel free to push your solution to the master and I can play with it a bit
more.


Reply to this email directly or view it on GitHub
#4 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants