[Idea] Experiment with indexing transcriptomics data on NeMO #73

rly · 2024-05-25T01:11:31Z

Transcriptomics data on the NeMO archive are often stored as ascii text files (fastq, fasta, mex) that are sometimes tarballed, and sometimes gzipped. I have also found tarballed BAM files (binary).

You can index the files in a tarball with byte ranges using the tarball header. And supposedly you can also index gzipped files and decompress byte ranges of those as well.

Example BICCN data:
https://data.nemoarchive.org/biccn/grant/u01_lein/lein/transcriptome/sncell/10x_v3/
https://data.nemoarchive.org/biccn/grant/u01_lein/linnarsson/transcriptome/sncell/10x_v2/human/processed/CellRanger5/
https://data.nemoarchive.org/biccn/grant/u19_huang/arlotta/transcriptome/sncell/10x_v2/mouse/processed/align/

Some of these data files can be very large, and a user may want to access only particular elements of the data file without having to download the entire file. I wonder if we can use LINDI to create an efficient JSON index of specific data elements within a NeMO-hosted dataset for streaming and local access. Just an idea right now as we brainstorm for the grant proposal.

BDBags can be used to index and download particular files of a dataset but I don't know if this works within a tarball or within a FASTQ file.

rly added the category: proposal proposed enhancements or new features label May 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Idea] Experiment with indexing transcriptomics data on NeMO #73

[Idea] Experiment with indexing transcriptomics data on NeMO #73

rly commented May 25, 2024

[Idea] Experiment with indexing transcriptomics data on NeMO #73

[Idea] Experiment with indexing transcriptomics data on NeMO #73

Comments

rly commented May 25, 2024