gene id search is slow in large genome #2803

silinzhong · 2022-03-14T13:08:02Z

silinzhong
Mar 14, 2022

Hi

I setup a Jbrowse2 with a few genomes. For the large maize genome (2G), it takes 1-2 min to search for a gene ID. It is almost instant when searching for genes in other smaller genomes. I have tired put the whole jbrowse2 folder in a SSD drive and also make a fake small GFF file with only gene ID lines and text-index it instead of the large GFF with all the isoform, intron and exon information. But it is still too slow.

Any suggestion?

Best
silin

http://www.epigenome.cuhk.edu.hk/jbrowse2/?session=local-_XVfgIvX5

Answered by cmdcolin

Apr 26, 2022

Users can now use a --prefixSize argument to text-index to improve search index size. In later versions, we may try to auto-calculate this to improve!

View full answer

cmdcolin · 2022-03-14T19:25:15Z

cmdcolin
Mar 14, 2022
Maintainer

Hi there
this is an interesting question, would definitely want to reproduce if possible

I visited this link http://www.epigenome.cuhk.edu.hk/jbrowse2/?session=share-erGje5JdgP&password=2UElo and searched an example gene ID. It downloads about 700kb of data which is a fair amount of data, but the results otherwise show up quickly

I think there could be a thing where many of the genes have a similar prefix, and since the code uses the prefix to do searching (but only up to a limit) then it ends up having to download a large amount of data to do the searches. Maybe we can look into how to make the prefix size large, I think it is hardcoded right now

2 replies

cmdcolin Mar 14, 2022
Maintainer

The prefix size is indeed hardcoded to 5 in our ixixx-js so if all the genes have effectively the same prefix with the same 5 characters, it will have to download a lot of data to find the right one

cmdcolin Mar 14, 2022
Maintainer

We will try to make this a configurable setting GMOD/ixixx-js#12

nhartwic · 2022-03-16T23:56:51Z

nhartwic
Mar 16, 2022

I encountered a similar issue #2826.

In my experience, gene ids are extremely likely to share a common prefix identifying the assembly the gene models exist for. For reference, for my 20 plant genomes pulled form phytozome, the majority have a greater than 5 character common prefix for all gene models.

I think adjusting prefix size is probably the best option. In addition to making it user configurable, I think we should be trying to dynamically determine prefix size during index creation. This could be something as simple as "choose prefix size K such that the ixx file is as close to L lines as possible" where L is some configurable constant, maybe defaulting to 1000 lines or something like it, I'm not sure where performance optimum is here.

Long term, the current trix concept may want for being updated. Even with dynamically set prefix sizes, there are still going to be degenerate cases where the '*.ixx' index fails to meaningfully partition the '*.ix' file into similarly sized buckets. The simplest way around this conceptually is probably to change the '*.ixx' file from being <prefix> -> <start, stop> mappings to be a sorted <object> -> <location> mapping. A search string can then be compared against those objects to identify which pair of objects the search string lies between and then that chunk of the '*.ix' file can be searched. The objects can then be selected to partition the database into equivalent sizes regardless of the content of the objects.

Also, I still think a larger chunksize for reading the ix file would make sense though I really have no idea where that is controlled.

6 replies

nhartwic Apr 15, 2022

Thanks. This should help a ton.

cmdcolin Apr 26, 2022
Maintainer

There was a bug in the initial release of this, but it should be fixed now in v1.7.5 (the client side code now auto-detects the prefix size...previously the client side hardcoded the prefix size)

nhartwic Apr 27, 2022

I was able to setup the latest build and it seems like the prefix setting option is working and being handled correctly on both the client and server side. I have noticed some strange behavior in that if the stated prefix size is 'too large' it will choose a smaller prefix size, though I don't understand why it is doing so. As an example here, I have a set of ids that vary in length from 12 to 15 characters. But I get the same 10-character prefix ixx file whenever I use a prefixSize in the range of 10-12, inclusive. This is definitely counterintuitive to me. Any idea what is happening here?

cmdcolin Apr 27, 2022
Maintainer

hmm, not sure i fully understand. it may be something I should play with a bit more locally but if you have any sample data, I would be interested.

nhartwic Apr 27, 2022

Sure. Here is an assembly and a gff3....

https://salk-tm-pub.s3.us-west-2.amazonaws.com/jbrowse2/data/Athaliana.Col-0.HPIv02/Athaliana.Col-0.HPIv02.genome.fasta
https://salk-tm-pub.s3.us-west-2.amazonaws.com/jbrowse2/data/Athaliana.Col-0.HPIv02/Athaliana.Col-0.HPIv02.gene.sorted.gff3.gz

...When I run the text index with prefixSize=7 or 8 or 9, I always end up with an ixx file with 565 entries in it. It seems like using a higher prefixSize should just result in more prefixes getting into the ixx file, but they just don't show up. They definitely exist. There are something like 27655 unique prefixes of length 9. But they don't end up the ixx file for some reason.

This behavior seems not entirely unreasonable. 565 entries is really close to an ideal size from a performance perspective, I just don't understand why this behavior is occuring. It seems like prefixSize is being ignored under some conditions that arne't well specified at the moment.

rbuels · 2022-03-18T20:07:52Z

rbuels
Mar 18, 2022
Maintainer

@teresam856 could you weigh in on this?

…

On Wed, Mar 16, 2022 at 4:57 PM nhartwic ***@***.***> wrote: I encountered a similar issue #2826 <#2826>. In my experience, gene ids are extremely likely to share a common prefix identifying the assembly the gene models exist for. For reference, for my 20 plant genomes pulled form phytozome, the majority have a greater than 5 character common prefix for all gene models. I think adjusting prefix size is probably the best option. Rather than just making it user configurable, I think we should be trying to dynamically determine prefix size during index creation. This could be something as simple as "choose prefix size K such that the ixx file is as close to L lines as possible" where L is some configurable constant, maybe defaulting to 1000 lines or something like it, I'm not sure where performance optimum is here. Long term, the current trix concept may want for being updated. Even with dynamically set prefix sizes, there are still going to be degenerate cases where the '*.ixx' index fails to meaningfully partition the '*.ix' file into similarly sized buckets. The simplest way around this conceptually is probably to change the '*.ixx' file from being -> <start, stop> mappings to be a sorted -> mapping. A search string can then be compared against those mappings to identify which pair of objects the search string lies between and then that chunk of the '.ix' file can be searched. The objects can then be selected to partition the database into equivalent sizes regardless of the content of the objects.* — Reply to this email directly, view it on GitHub <#2803 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAASAFIDN6MO4ECQANUVLP3VAJYM5ANCNFSM5QVPE7ZA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.*** .com>

0 replies

cmdcolin · 2022-04-14T22:02:48Z

cmdcolin
Apr 14, 2022
Maintainer

There is now a --prefixSize argument to jbrowse text-index that can be used to try to optimize this behavior. You can set a longer prefix-size, and potentially do this on only a per-track basis if it helps!

1 reply

cmdcolin Apr 14, 2022
Maintainer

Let me know if that helps. Feature released in v1.7.0 (can re-install CLI tools with npm install -g @jbrowse/cli to get updates)

cmdcolin · 2022-04-26T00:56:57Z

cmdcolin
Apr 26, 2022
Maintainer

Users can now use a --prefixSize argument to text-index to improve search index size. In later versions, we may try to auto-calculate this to improve!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gene id search is slow in large genome #2803

{{title}}

Replies: 5 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

gene id search is slow in large genome #2803

silinzhong Mar 14, 2022

Replies: 5 comments · 9 replies

cmdcolin Mar 14, 2022 Maintainer

cmdcolin Mar 14, 2022 Maintainer

cmdcolin Mar 14, 2022 Maintainer

nhartwic Mar 16, 2022

nhartwic Apr 15, 2022

cmdcolin Apr 26, 2022 Maintainer

nhartwic Apr 27, 2022

cmdcolin Apr 27, 2022 Maintainer

nhartwic Apr 27, 2022

rbuels Mar 18, 2022 Maintainer

cmdcolin Apr 14, 2022 Maintainer

cmdcolin Apr 14, 2022 Maintainer

cmdcolin Apr 26, 2022 Maintainer

silinzhong
Mar 14, 2022

Replies: 5 comments 9 replies

cmdcolin
Mar 14, 2022
Maintainer

cmdcolin Mar 14, 2022
Maintainer

cmdcolin Mar 14, 2022
Maintainer

nhartwic
Mar 16, 2022

cmdcolin Apr 26, 2022
Maintainer

cmdcolin Apr 27, 2022
Maintainer

rbuels
Mar 18, 2022
Maintainer

cmdcolin
Apr 14, 2022
Maintainer

cmdcolin Apr 14, 2022
Maintainer

cmdcolin
Apr 26, 2022
Maintainer