Unexpected behavior when downloading fastq using SRA identifier #34

jolespin · 2023-12-13T19:35:14Z

https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=SRR13615821&display=metadata

I ran kingfisher and it pulled 3 fastq files for 1 record. A single ended and 2 paired end files.

(base) [jespinoz@exp-15-28 split_reads]$ kingfisher --version
0.3.1

ID=SRR13615821
kingfisher get -r ${ID} -m aws-http -f fastq.gz

I thought that maybe one was interleaved but the read sizes didn't match up:

(base) [jespinoz@exp-15-28 Fastq]$ seqkit stats SRR13615821_1.fastq.gz SRR13615821_2.fastq.gz split_reads/SRR13615821.fastq.gz
processed files:  3 / 3 [======================================] ETA: 0s. done
file                              format  type   num_seqs        sum_len  min_len  avg_len  max_len
SRR13615821_1.fastq.gz            FASTQ   DNA     808,228    197,172,014       35      244      301
SRR13615821_2.fastq.gz            FASTQ   DNA     808,228    199,461,172       21    246.8      301
split_reads/SRR13615821.fastq.gz  FASTQ   DNA   5,860,790  1,438,979,322       35    245.5      301

The above files were what were downloaded by kingfisher.

Note: I moved SRR13615821.fastq.gz into a separate folder to split the reads but BBSuite said there were no pairs:

base) [jespinoz@exp-15-28 split_reads]$ repair.sh in=SRR13615821.fastq.gz out1=SRR13615821_1.fastq.gz out2=SRR13615821_2.fastq.gz
java -ea -Xmx84979m -cp /expanse/projects/jcl110/miniconda3/opt/bbmap-39.01-1/current/ jgi.SplitPairsAndSingles rp in=SRR13615821.fastq.gz out1=SRR13615821_1.fastq.gz out2=SRR13615821_2.fastq.gz
Executing jgi.SplitPairsAndSingles [rp, in=SRR13615821.fastq.gz, out1=SRR13615821_1.fastq.gz, out2=SRR13615821_2.fastq.gz]

Set INTERLEAVED to false
Started output stream.

Input:                  	5860790 reads 		1438979322 bases.
Result:                 	5860790 reads (100.00%) 	1438979322 bases (100.00%)
Pairs:                  	0 reads (0.00%) 	0 bases (0.00%)
Singletons:             	5860790 reads (100.00%) 	1438979322 bases (100.00%)

Time:                         	36.897 seconds.
Reads Processed:       5860k 	158.84k reads/sec
Bases Processed:       1438m 	39.00m bases/sec

The above is me trying to split the reads manually.

Do you know what could be happening?

The text was updated successfully, but these errors were encountered:

jolespin · 2023-12-13T19:43:20Z

I tried downloading using a separate command:

(base) [jespinoz@exp-15-28 tmp]$ kingfisher get -r SRR13615821 -m ena-ascp aws-http prefetch
12/13/2023 11:39:51 AM INFO: Kingfisher v0.3.1
12/13/2023 11:39:51 AM INFO: Attempting download method ena-ascp for run SRR13615821 ..
12/13/2023 11:39:51 AM INFO: Using aspera ssh key file: /expanse/projects/jcl110/miniconda3/lib/python3.11/site-packages/kingfisher/data/asperaweb_id_dsa.openssh
12/13/2023 11:39:51 AM INFO: Querying ENA for FTP paths for SRR13615821..
12/13/2023 11:39:52 AM INFO: Downloading 3 FTP read set(s): ftp.sra.ebi.ac.uk/vol1/fastq/SRR136/021/SRR13615821/SRR13615821.fastq.gz, ftp.sra.ebi.ac.uk/vol1/fastq/SRR136/021/SRR13615821/SRR13615821_1.fastq.gz, ftp.sra.ebi.ac.uk/vol1/fastq/SRR136/021/SRR13615821/SRR13615821_2.fastq.gz
12/13/2023 11:39:52 AM INFO: Running command: ascp -T -l 300m -P33001 -k 2 -i /expanse/projects/jcl110/miniconda3/lib/python3.11/site-packages/kingfisher/data/asperaweb_id_dsa.openssh [email protected]:/vol1/fastq/SRR136/021/SRR13615821/SRR13615821.fastq.gz .
12/13/2023 11:39:52 AM WARNING: Error downloading from ENA with ASCP: Command ascp -T -l 300m -P33001 -k 2 -i /expanse/projects/jcl110/miniconda3/lib/python3.11/site-packages/kingfisher/data/asperaweb_id_dsa.openssh [email protected]:/vol1/fastq/SRR136/021/SRR13615821/SRR13615821.fastq.gz . returned non-zero exit status 127.
STDERR was: b'bash: ascp: command not found\n'STDOUT was: b''
12/13/2023 11:39:52 AM WARNING: Method ena-ascp failed
12/13/2023 11:39:52 AM INFO: Attempting download method aws-http for run SRR13615821 ..
12/13/2023 11:39:53 AM INFO: Found ODP link https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR13615821/SRR13615821
12/13/2023 11:39:53 AM INFO: Downloading .SRA file from AWS Open Data Program HTTP link using aria2c ..

12/13 11:39:53 [NOTICE] Downloading 1 item(s)

12/13 11:39:54 [NOTICE] Allocating disk space. Use --file-allocation=none to disable it. See --file-allocation option in man page for more details.
[#2f4efe 831MiB/852MiB(97%) CN:1 DL:104MiB]
12/13 11:40:05 [NOTICE] Download complete: /expanse/projects/jcl110/VEBA_v2_CaseStudies/Kolyma_Permafrost/Fastq/tmp/SRR13615821.sra

Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
2f4efe|OK  |    89MiB/s|/expanse/projects/jcl110/VEBA_v2_CaseStudies/Kolyma_Permafrost/Fastq/tmp/SRR13615821.sra

Status Legend:
(OK):download completed.
12/13/2023 11:40:05 AM INFO: Download finished, validating ..
12/13/2023 11:40:05 AM INFO: Method aws-http worked.
12/13/2023 11:40:05 AM INFO: Extracting .sra file with fasterq-dump ..
12/13/2023 11:40:46 AM INFO: Output files: SRR13615821_1.fastq, SRR13615821_2.fastq, SRR13615821.fastq
12/13/2023 11:40:46 AM INFO: Kingfisher done.

wwood · 2023-12-13T21:05:31Z

Sometimes this happens when people upload reads that have been QC'd (so some pairs become single-ended reads), I think.

I don't think there is any issue with kingfisher - looks like it is just the NCBI webpage being misleading? EBI has 3 files too:
https://www.ebi.ac.uk/ena/browser/view/SRR13615821

jolespin · 2023-12-14T18:08:11Z

From NCBI Help Desk:

Checking. I'd have to pull the originals to check, but my preliminary guess is that this arises because of asymmetry in the pairs: R2 might have less pairs (perhaps eliminated in aggressive QC?), accounting for a lopsided size difference between R1 and R2, where the "single ended file" is mostly R1.

I see this "three file" behavior with fasterq-dump, but not with a generic fastq-dump

fastq-dump --split-files --origfmt SRR13615821
Rejected 5860790 READS because READLEN < 1
Read 6669018 spots for SRR13615821
Written 6669018 spots for SRR13615821

wc -l SRR13615821*
26676072 SRR13615821_1.fastq
3232912 SRR13615821_2.fastq

grep "^@" SRR13615821_1.fastq | wc -l
6669018 #where 6,669,018 rounds up to the 6.7M reported by SRA pages.
grep "^@" SRR13615821_2.fastq | wc -l
808228

SRA Curator

Not sure if this is helpful or not for you. It's the first time I've experienced an issue like this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected behavior when downloading fastq using SRA identifier #34

Unexpected behavior when downloading fastq using SRA identifier #34

jolespin commented Dec 13, 2023 •

edited

Loading

jolespin commented Dec 13, 2023

wwood commented Dec 13, 2023

jolespin commented Dec 14, 2023

Unexpected behavior when downloading fastq using SRA identifier #34

Unexpected behavior when downloading fastq using SRA identifier #34

Comments

jolespin commented Dec 13, 2023 • edited Loading

jolespin commented Dec 13, 2023

wwood commented Dec 13, 2023

jolespin commented Dec 14, 2023

jolespin commented Dec 13, 2023 •

edited

Loading