Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I confirm anonymization of bam files #4

Open
mropri92 opened this issue Jul 4, 2023 · 15 comments
Open

How do I confirm anonymization of bam files #4

mropri92 opened this issue Jul 4, 2023 · 15 comments

Comments

@mropri92
Copy link

mropri92 commented Jul 4, 2023

Hi,

Thanks you for this awesome tool. Just a question I had was, how would I confirm that after running BAMboozle on my bam file that it actually has been anonymized? Appreciate any help.

@cziegenhain
Copy link
Collaborator

Hi,

If you want to be super precise, I recommend to pileup all reads over all positions using samtools mpileup
http://www.htslib.org/doc/samtools-mpileup.html

@mropri92
Copy link
Author

mropri92 commented Jul 4, 2023

Thank you for your help. Just to make sure I understand, once I run mpileup on my before BAMboozle bam file and mpileup on my after BAMboozle bam file, I can compare the base calls in the read based column? Sorry for my naivety, and am very thankful for your help.

@cziegenhain
Copy link
Collaborator

No problem! You don't necessarily need to run mpileup on the original bam file, since you are only really interested in to check that there is no more bases mismatching to reference after running BAMboozle!

@mropri92
Copy link
Author

mropri92 commented Jul 4, 2023

Sounds good, thanks for your help and time!

@mropri92
Copy link
Author

mropri92 commented Jul 4, 2023

I was able to use mpileup to look at one bam file and determine that it was anonymized. However, this was tedious. Is there a quicker, more streamlined way to determine if my file is anonymized? I only ask as I have to do this for multiple files. Sorry for the bother again. And as always, appreciate your help.

@cziegenhain
Copy link
Collaborator

Hi,
Not sure what you mean with tedious here. You could think about just starting multiple/all mpileup calls in parallel.

@mropri92
Copy link
Author

mropri92 commented Jul 5, 2023

Hi,

Sorry for not being clear on the tedious part. I guess when you advised to check that there are no more bases mismatching to the reference after running BAMboozle, I took it to mean to check for every position in the bam file. Is there a specific position that we are just looking for to check that there are no mismatches?

@cziegenhain
Copy link
Collaborator

Exactly, the proper way of verifying would indeed be to look at all covered positions!

@mropri92
Copy link
Author

mropri92 commented Jul 5, 2023

Gotcha, thank you again for helping me with this and taking the time to explain. Appreciate your help!

@sropri
Copy link

sropri commented Aug 3, 2023

Jut a further question for clarification. If we are checking in our mpileup output that after running BAMboozle there should be no more mismatching bases to the reference genome, wouldn't SNPs be called as mismatching? And wouldn't this give you some number of mismatching bases? How does BAMboozle account for that? Appreciate your help.

@cziegenhain
Copy link
Collaborator

Hi,

Not sure if I understand your question correctly. Since the goal of BAMboozle is to remove any donor genetic variation, there should be no more SNPs present in the data after BAMboozling the reads.

@sropri
Copy link

sropri commented Aug 3, 2023

Oh, so the SNPs are anonymized as well to remove donor genetic variation. Sorry for the confusion. I just had one quick question as well. After the mpileup output file, is there a command that you use to just give you the number of mismatches contained in the output file? Sorry for my question, am just new to samtools and appreciate your help.

@sropri
Copy link

sropri commented Aug 3, 2023

Because from my understanding, I should be looking in the 5th column of the mpileup output file for any occurences of ACGTN or agctn as they represent mismatches. Thus, I was running this command:

file="output.txt"

Count occurrences of "ACGTN" or "acgtn" in the 5th column

total_occurrences=$(awk -F'\t' '{count += gsub(/[ACGTNacgtn]/, "");} END{print count}' "$file")

echo "Total occurrences of ACGTN or acgtn in the 5th column: $total_occurrences"

Appreciate your guidance in this

@cziegenhain
Copy link
Collaborator

The text format of mpileup is a bit difficult to parse I think, so for me the easiest way is to use the bcftools branch which has VCF format support like this:

  1. bcftools mpileup --fasta-ref your_ref.fa --max-depth 10000 --threads 10 --output-type z --output bamboozle.mpileup.vcf.gz your_bamboozle_output.bam
  2. bcftools view --min-alleles 3 --threads 10 -O z -o bamboozle.mpileup.filter.vcf.gz bamboozle.mpileup.vcf.gz

The filtered output VCF should then be completely empty (other than the header).

Hope that helps.

@sropri
Copy link

sropri commented Aug 3, 2023

This great. Thanks for your help. Will try this out and sorry for the bother.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants