How do I confirm anonymization of bam files #4

mropri92 · 2023-07-04T16:17:59Z

Hi,

Thanks you for this awesome tool. Just a question I had was, how would I confirm that after running BAMboozle on my bam file that it actually has been anonymized? Appreciate any help.

cziegenhain · 2023-07-04T16:59:51Z

Hi,

If you want to be super precise, I recommend to pileup all reads over all positions using samtools mpileup
http://www.htslib.org/doc/samtools-mpileup.html

mropri92 · 2023-07-04T17:05:22Z

Thank you for your help. Just to make sure I understand, once I run mpileup on my before BAMboozle bam file and mpileup on my after BAMboozle bam file, I can compare the base calls in the read based column? Sorry for my naivety, and am very thankful for your help.

cziegenhain · 2023-07-04T18:07:17Z

No problem! You don't necessarily need to run mpileup on the original bam file, since you are only really interested in to check that there is no more bases mismatching to reference after running BAMboozle!

mropri92 · 2023-07-04T18:08:43Z

Sounds good, thanks for your help and time!

mropri92 · 2023-07-04T23:13:53Z

I was able to use mpileup to look at one bam file and determine that it was anonymized. However, this was tedious. Is there a quicker, more streamlined way to determine if my file is anonymized? I only ask as I have to do this for multiple files. Sorry for the bother again. And as always, appreciate your help.

cziegenhain · 2023-07-05T04:18:01Z

Hi,
Not sure what you mean with tedious here. You could think about just starting multiple/all mpileup calls in parallel.

mropri92 · 2023-07-05T07:47:55Z

Hi,

Sorry for not being clear on the tedious part. I guess when you advised to check that there are no more bases mismatching to the reference after running BAMboozle, I took it to mean to check for every position in the bam file. Is there a specific position that we are just looking for to check that there are no mismatches?

cziegenhain · 2023-07-05T09:19:49Z

Exactly, the proper way of verifying would indeed be to look at all covered positions!

mropri92 · 2023-07-05T09:30:36Z

Gotcha, thank you again for helping me with this and taking the time to explain. Appreciate your help!

sropri · 2023-08-03T18:23:14Z

Jut a further question for clarification. If we are checking in our mpileup output that after running BAMboozle there should be no more mismatching bases to the reference genome, wouldn't SNPs be called as mismatching? And wouldn't this give you some number of mismatching bases? How does BAMboozle account for that? Appreciate your help.

cziegenhain · 2023-08-03T18:35:49Z

Hi,

Not sure if I understand your question correctly. Since the goal of BAMboozle is to remove any donor genetic variation, there should be no more SNPs present in the data after BAMboozling the reads.

sropri · 2023-08-03T18:39:13Z

Oh, so the SNPs are anonymized as well to remove donor genetic variation. Sorry for the confusion. I just had one quick question as well. After the mpileup output file, is there a command that you use to just give you the number of mismatches contained in the output file? Sorry for my question, am just new to samtools and appreciate your help.

sropri · 2023-08-03T18:50:12Z

Because from my understanding, I should be looking in the 5th column of the mpileup output file for any occurences of ACGTN or agctn as they represent mismatches. Thus, I was running this command:

file="output.txt"

Count occurrences of "ACGTN" or "acgtn" in the 5th column

total_occurrences=$(awk -F'\t' '{count += gsub(/[ACGTNacgtn]/, "");} END{print count}' "$file")

echo "Total occurrences of ACGTN or acgtn in the 5th column: $total_occurrences"

Appreciate your guidance in this

cziegenhain · 2023-08-03T18:56:29Z

The text format of mpileup is a bit difficult to parse I think, so for me the easiest way is to use the bcftools branch which has VCF format support like this:

bcftools mpileup --fasta-ref your_ref.fa --max-depth 10000 --threads 10 --output-type z --output bamboozle.mpileup.vcf.gz your_bamboozle_output.bam
bcftools view --min-alleles 3 --threads 10 -O z -o bamboozle.mpileup.filter.vcf.gz bamboozle.mpileup.vcf.gz

The filtered output VCF should then be completely empty (other than the header).

Hope that helps.

sropri · 2023-08-03T19:01:58Z

This great. Thanks for your help. Will try this out and sorry for the bother.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I confirm anonymization of bam files #4

How do I confirm anonymization of bam files #4

mropri92 commented Jul 4, 2023

cziegenhain commented Jul 4, 2023

mropri92 commented Jul 4, 2023

cziegenhain commented Jul 4, 2023

mropri92 commented Jul 4, 2023

mropri92 commented Jul 4, 2023

cziegenhain commented Jul 5, 2023

mropri92 commented Jul 5, 2023

cziegenhain commented Jul 5, 2023

mropri92 commented Jul 5, 2023

sropri commented Aug 3, 2023

cziegenhain commented Aug 3, 2023

sropri commented Aug 3, 2023

sropri commented Aug 3, 2023

cziegenhain commented Aug 3, 2023

sropri commented Aug 3, 2023

How do I confirm anonymization of bam files #4

How do I confirm anonymization of bam files #4

Comments

mropri92 commented Jul 4, 2023

cziegenhain commented Jul 4, 2023

mropri92 commented Jul 4, 2023

cziegenhain commented Jul 4, 2023

mropri92 commented Jul 4, 2023

mropri92 commented Jul 4, 2023

cziegenhain commented Jul 5, 2023

mropri92 commented Jul 5, 2023

cziegenhain commented Jul 5, 2023

mropri92 commented Jul 5, 2023

sropri commented Aug 3, 2023

cziegenhain commented Aug 3, 2023

sropri commented Aug 3, 2023

sropri commented Aug 3, 2023

Count occurrences of "ACGTN" or "acgtn" in the 5th column

cziegenhain commented Aug 3, 2023

sropri commented Aug 3, 2023