paired reads have different names #18

Mahmoudbassuoni · 2023-03-20T09:06:17Z

Hi, I am trying to run the alignment using bwa mem for the 2 files "U0a_CGATGT_L001_R1_001.fastq.gz" "U0a_CGATGT_L001_R2_001.fastq.gz" I already got from the FTP site with the reference "GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz" and the command I am using is
bwa mem -t 16 -R '@RG\tID:H814YADXX.5.CGATGT.1101\tSM:HG001\tPL:illumina' GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz U0a_CGATGT_L001_R1_001.fastq.gz U0a_CGATGT_L001_R2_001.fastq.gz | samtools view -b - >HG001.GRCh38_no_alt_analysis_set.bam

but I am getting an error with the sequence headers:

[mem_sam_pe]` paired reads have different names: "HWACAGATTTTGT", "HWI-D00360:5:H814YADXX:1:1102:11719:83283"
[mem_sam_pe] paired reads have different names: "HWACTATTDDD", "HWI-D00360:5:H814YADXX:1:1102:11293:83492"
[mem_sam_pe] paired reads have different names: "@@faaa(+:A0&AA", "HWI-D00360:5:H814YADXX:1:1102:11730:83321"
[mem_sam_pe] paired reads have different names: "HWI-A@HWI-D00360:5:H814YX:1:1102:10399:83348", "HWI-00360:5:H814YADXX:1:1102:11699:83300"
[mem_sam_pe] paired reads have different names: "ACD00360TJJJJC@AGCCCTGCACCACCTAATAAGAACTGGAAAGTCEEDDDDDDDD", "HWI-D00360:5:H814YADXX:1:1102:11719:83361"
[mem_sam_pe] paired reads have different names: "HWCTAAAATC:BDDDDFDDDDDDCEDDDHJJEHIIIJJJHHH>HFFEEEEET:83ACDDDDTAAATTEDDDDDDEDDDDJJFHJJJJJJJJJJJJJJJJJJJJJJIJJJJJJ@T4BJJJTTATCTTG>FGGCAGGCTJJIJJJJJEDEECDDFAAGTAAADDDDDDDCTCTTCTTGTTTTCCCC>AGCC60:5:HC814YJDDDCCDDIGCCCTTC1IIIIHIEDDD@FFFCTTC1IIIIHIEDCCC;>CC60:5:H:0:CGADXX:1:1ATGTTTA:N:0:CGAC>CGAC>CG3AGGCTGAGGYADXX:JJJJJJJJIJJA0360GAIAGEEDEEEEC:GJIIJJJC:0:CGATGIFFFHHHHHJJJJJDEDDDDDGDEDDDDGTTTTTAT@HHJJJTGT", "HWI-D00360:5:H814YADXX:1:1102:11549:83491"
[mem_sam_pe] paired reads have different names: "HWCATCCTCCCAAGACTAADD@FFFC99:833C99:833CGCTTTGFHH@FFFFDDDCCCDCFB:>CA8>A??CC:A:ACTTACTCAAAAAACTATH814CAAATGCAGDDD:TTAAGTTCACAGCGA8DEDDDDDGJJJJJJDDDDDBDDDDDDDDDDDDDTGGACTTTJJHHHF60:5:HHH@FFFGTGGCAGGCTCCTGTAACGDDDDDDDDATGAACTCIACTAGDDDBBDDG9ATGGAATTTGACTTGADXX:1CACCTGCCAAACATACCCGTCTTTACC(G36CAGACCACCTGGACTTCCAGGEECDCDCDGAGGCCTGGCCATGTTATATGAAGTGIDXX:1CACCTGCCAAACATACCCGT", "HWI-D00360:5:H814YADXX:1:1102:11746:83407"
[mem_sam_pe] paired reads have different names: "HWACTATTDEFFFHHHHCCTTGTGTE:@DDDD49?IJJIGIG83407", "HWI-D00360:5:H814YADXX:1:1102:11545:83354"

I have tried to sorting the 2 files using fastq-sort but still getting the same error, anyone can help ?

The text was updated successfully, but these errors were encountered:

chunlinxiao · 2023-03-20T13:32:08Z

You need to use the sequence.index file (https://github.com/genome-in-a-bottle/giab_data_indexes/blob/master/NA12878/sequence.index.NA12878_Illumina300X_wgs_09252015 in your case) to match R1 and R2 files.

For 300X ILMN raw reads, some R1/R2 files may have same names, but located in different directories, e.g.,

ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R1_001.fastq.gz cabfe5b609fb1fe11619fdc72060185c ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R2_001.fastq.gz 6f0faed9249c1a850e6ce57c61e26e04 HG001

ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_006_AH81VLADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R1_001.fastq.gz cc35b61053fe7505715f93175bbb16c4 ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_006_AH81VLADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R2_001.fastq.gz cd12a23c3d71061e1bc673ce8c598dba HG001

Hope this helps.

Mahmoudbassuoni · 2023-03-20T13:44:13Z

yeah I have used the forward and the reverse reads for the same run from the same folder which is supposed to be on the same line in the link you posted. so I mean I used the links for the ftp from one line which is supposed to be matching the same run.

chunlinxiao · 2023-03-20T14:05:40Z

In your example, can you post the full path of the two files you were using for mapping? have you checked the md5?

Mahmoudbassuoni · 2023-03-21T10:41:10Z

Hi,
That was the Forward strand:
"ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R1_001.fastq.gz"
and that was the Reverse one:
"ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R2_001.fastq.gz"

Mahmoudbassuoni · 2023-03-21T10:53:40Z

I have checked the md5 now and it looks something wrong with the files download, I am downloading it now and will check it again, and get back to you. Thanks,

Mahmoudbassuoni · 2023-03-21T11:25:35Z

Hi , @chunlinxiao
I have downloaded the files again but still the output of the md5sum not matching the one on the ftp site, I am not sure what could be wrong, I have tried the same thing with another 2 strands and the same happens.

Mahmoudbassuoni · 2023-03-21T13:22:25Z

I have tried to do the alignment process using 2 paired reads from the folder "giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/" and it went fine but I am not sure of the data quality as those files are from 2014 however the other files above are from 2020 so it is supposed to be more reliable

chunlinxiao · 2023-03-21T20:47:51Z

thanks for the update and glad your alignment process was fine now - I also tested your pairs on our side, nothing was wrong, so the paired data is fine.

Regarding the md5, we recently performed a metadata collection/analysis regarding all fastqs, involving gunzip/gzip - this may produce different md5s (from different gz file header if not using gzip -n ). However, the uncompressed file (fastq file) are unchanged with identical md5. The sequence.index files may need to be updated accordingly.

Mahmoudbassuoni · 2023-03-22T10:36:04Z

so what do you think of depending on the old FastQs from 2014 ? I am running a benchmarking process so is it fine to use those fastqs and then using the VCFs from the NIST V4 directory ?

jzook · 2023-03-22T14:23:08Z

Hi @Mahmoudbassuoni - all of the files in those directories were generated ~2014. They are probably ok to use for some purposes, but if you want to understand how your methods work on more recent illumina data, you may want to use data from this publication: https://doi.org/10.1101/2020.12.11.422022.

chunlinxiao · 2023-04-06T19:35:41Z

Hi @Mahmoudbassuoni , the md5s were updated in sequence.index.NA12878_Illumina300X_wgs_09252015_updated (you can follow the link from the table).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paired reads have different names #18

paired reads have different names #18

Mahmoudbassuoni commented Mar 20, 2023 •

edited

Loading

chunlinxiao commented Mar 20, 2023 •

edited

Loading

Mahmoudbassuoni commented Mar 20, 2023

chunlinxiao commented Mar 20, 2023

Mahmoudbassuoni commented Mar 21, 2023

Mahmoudbassuoni commented Mar 21, 2023

Mahmoudbassuoni commented Mar 21, 2023

Mahmoudbassuoni commented Mar 21, 2023

chunlinxiao commented Mar 21, 2023

Mahmoudbassuoni commented Mar 22, 2023

jzook commented Mar 22, 2023

chunlinxiao commented Apr 6, 2023

paired reads have different names #18

paired reads have different names #18

Comments

Mahmoudbassuoni commented Mar 20, 2023 • edited Loading

chunlinxiao commented Mar 20, 2023 • edited Loading

Mahmoudbassuoni commented Mar 20, 2023

chunlinxiao commented Mar 20, 2023

Mahmoudbassuoni commented Mar 21, 2023

Mahmoudbassuoni commented Mar 21, 2023

Mahmoudbassuoni commented Mar 21, 2023

Mahmoudbassuoni commented Mar 21, 2023

chunlinxiao commented Mar 21, 2023

Mahmoudbassuoni commented Mar 22, 2023

jzook commented Mar 22, 2023

chunlinxiao commented Apr 6, 2023

Mahmoudbassuoni commented Mar 20, 2023 •

edited

Loading

chunlinxiao commented Mar 20, 2023 •

edited

Loading