You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using STARsolo on a 10X v2 scRNA-seq dataset where the barcode read contained the CB + UMI sequence is not 26 base pairs but 150 base pairs in length. In order to run STAR solo, I set the --soloBarcodeReadLength to 150. The following was my invocation command and the output from STAR solo :
STAR solo runs perfectly without any error messages in any of the log files. Below is the output of the Summary.csv file in the 150bp_barcode_run/Gene directory :
Number of Reads,250000
Reads With Valid Barcodes,0.969452
Sequencing Saturation,0.0840912
Q30 Bases in CB+UMI,0.598961
Q30 Bases in RNA read,0.880727
Reads Mapped to Genome: Unique+Multiple,0.842508
Reads Mapped to Genome: Unique,0.605768
Reads Mapped to Transcriptome: Unique+Multipe Genes,0.469796
Reads Mapped to Transcriptome: Unique Genes,0.429724
Estimated Number of Cells,4045
Reads in Cells Mapped to Unique Genes,79863
Fraction of Reads in Cells,0.743389
Mean Reads per Cell,19
Median Reads per Cell,15
UMIs in Cells,73178
Mean UMI per Cell,18
Median UMI per Cell,13
Mean Genes per Cell,16
Median Genes per Cell,13
Total Genes Detected,8958
The fraction of Q30 bases in the CB + UMI field is 0.598961. This came as a surprise to me since the following was the fastqc report on the barcodes.full.fastq.gz file :
As you can see, the first 26 bases have a lower quartile Q score of at least 30. But considering that the mean Q scores for the remaining bases are less than 30, I suspected that STAR solo was computing the average base quality scores across all 150 bases of the barcode read instead of restricting the Q30 score calculation to the first 26 bases of the barcode read.
Since it was still possible that the CB + UMI barcode is perhaps starting from somewhere in the middle of the read, as opposed to the first base onwards, I also checked to see if the barcode reads have an issue with where the CB + UMI sequence is located. There seemed to be no issue with that as you can see from a zcat of the barcode file (apologies for the formatting of the fastq reads) --- zcat barcodes.full.fastq.gz | head -n 100000 | tail -n 20
The barcodes are all located before the start of the poly-T tract from the first base onwards. Finally, I decided to truncate the fastq reads to the first 26 positions and re-run STAR solo ---
This time around, the Q30 values for the CB + UMI read made sense --- cat 26bp_barcode_runSolo.out/Gene/Summary.csv
Number of Reads,250000
Reads With Valid Barcodes,0.969452
Sequencing Saturation,0.0840912
Q30 Bases in CB+UMI,0.96489
Q30 Bases in RNA read,0.880727
Reads Mapped to Genome: Unique+Multiple,0.842508
Reads Mapped to Genome: Unique,0.605768
Reads Mapped to Transcriptome: Unique+Multipe Genes,0.469796
Reads Mapped to Transcriptome: Unique Genes,0.429724
Estimated Number of Cells,4045
Reads in Cells Mapped to Unique Genes,79863
Fraction of Reads in Cells,0.743389
Mean Reads per Cell,19
Median Reads per Cell,15
UMIs in Cells,73178
Mean UMI per Cell,18
Median UMI per Cell,13
Mean Genes per Cell,16
Median Genes per Cell,13
Total Genes Detected,8958
To me, this confirms that STAR solo is accidentally taking into account bases 26-150 in the barcode read when reporting the Q30 Bases in CB+UMI value in the CSV file.
The text was updated successfully, but these errors were encountered:
I'm using STARsolo on a 10X v2 scRNA-seq dataset where the barcode read contained the CB + UMI sequence is not 26 base pairs but 150 base pairs in length. In order to run STAR solo, I set the --soloBarcodeReadLength to 150. The following was my invocation command and the output from STAR solo :
STAR --runThreadN $SLURM_CPUS_PER_TASK --genomeDir /fdb/STAR_current/UCSC/hg38/genes-50/ --sjdbOverhang 50 --readFilesIn cdna.fastq.gz barcodes.full.fastq.gz --soloCBwhitelist $DATA/737K-august-2016.txt --soloType CB_UMI_Simple --readFilesCommand zcat --outFileNamePrefix 150bp_barcode_run --soloBarcodeReadLength 150
STAR solo runs perfectly without any error messages in any of the log files. Below is the output of the
Summary.csv
file in the150bp_barcode_run/Gene
directory :The fraction of Q30 bases in the
CB + UMI
field is0.598961
. This came as a surprise to me since the following was the fastqc report on thebarcodes.full.fastq.gz
file :As you can see, the first 26 bases have a lower quartile Q score of at least 30. But considering that the mean Q scores for the remaining bases are less than 30, I suspected that STAR solo was computing the average base quality scores across all 150 bases of the barcode read instead of restricting the Q30 score calculation to the first 26 bases of the barcode read.
Since it was still possible that the CB + UMI barcode is perhaps starting from somewhere in the middle of the read, as opposed to the first base onwards, I also checked to see if the barcode reads have an issue with where the CB + UMI sequence is located. There seemed to be no issue with that as you can see from a zcat of the barcode file (apologies for the formatting of the fastq reads) ---
zcat barcodes.full.fastq.gz | head -n 100000 | tail -n 20
The barcodes are all located before the start of the poly-T tract from the first base onwards. Finally, I decided to truncate the fastq reads to the first 26 positions and re-run STAR solo ---
STAR --runThreadN $SLURM_CPUS_PER_TASK --genomeDir /fdb/STAR_current/UCSC/hg38/genes-50/ --sjdbOverhang 50 --readFilesIn cdna.fastq.gz barcodes.fastq.gz --soloCBwhitelist $DATA/737K-august-2016.txt --soloType CB_UMI_Simple --readFilesCommand zcat --outFileNamePrefix 26bp_barcode_run
This time around, the Q30 values for the CB + UMI read made sense ---
cat 26bp_barcode_runSolo.out/Gene/Summary.csv
To me, this confirms that STAR solo is accidentally taking into account bases 26-150 in the barcode read when reporting the
Q30 Bases in CB+UMI
value in the CSV file.The text was updated successfully, but these errors were encountered: