-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How do I confirm anonymization of bam files #4
Comments
Hi, If you want to be super precise, I recommend to pileup all reads over all positions using |
Thank you for your help. Just to make sure I understand, once I run mpileup on my before BAMboozle bam file and mpileup on my after BAMboozle bam file, I can compare the base calls in the read based column? Sorry for my naivety, and am very thankful for your help. |
No problem! You don't necessarily need to run mpileup on the original bam file, since you are only really interested in to check that there is no more bases mismatching to reference after running BAMboozle! |
Sounds good, thanks for your help and time! |
I was able to use mpileup to look at one bam file and determine that it was anonymized. However, this was tedious. Is there a quicker, more streamlined way to determine if my file is anonymized? I only ask as I have to do this for multiple files. Sorry for the bother again. And as always, appreciate your help. |
Hi, |
Hi, Sorry for not being clear on the tedious part. I guess when you advised to check that there are no more bases mismatching to the reference after running BAMboozle, I took it to mean to check for every position in the bam file. Is there a specific position that we are just looking for to check that there are no mismatches? |
Exactly, the proper way of verifying would indeed be to look at all covered positions! |
Gotcha, thank you again for helping me with this and taking the time to explain. Appreciate your help! |
Jut a further question for clarification. If we are checking in our mpileup output that after running BAMboozle there should be no more mismatching bases to the reference genome, wouldn't SNPs be called as mismatching? And wouldn't this give you some number of mismatching bases? How does BAMboozle account for that? Appreciate your help. |
Hi, Not sure if I understand your question correctly. Since the goal of BAMboozle is to remove any donor genetic variation, there should be no more SNPs present in the data after BAMboozling the reads. |
Oh, so the SNPs are anonymized as well to remove donor genetic variation. Sorry for the confusion. I just had one quick question as well. After the mpileup output file, is there a command that you use to just give you the number of mismatches contained in the output file? Sorry for my question, am just new to samtools and appreciate your help. |
Because from my understanding, I should be looking in the 5th column of the mpileup output file for any occurences of ACGTN or agctn as they represent mismatches. Thus, I was running this command: file="output.txt" Count occurrences of "ACGTN" or "acgtn" in the 5th columntotal_occurrences=$(awk -F'\t' '{count += gsub(/[ACGTNacgtn]/, "");} END{print count}' "$file") echo "Total occurrences of ACGTN or acgtn in the 5th column: $total_occurrences" Appreciate your guidance in this |
The text format of mpileup is a bit difficult to parse I think, so for me the easiest way is to use the bcftools branch which has VCF format support like this:
The filtered output VCF should then be completely empty (other than the header). Hope that helps. |
This great. Thanks for your help. Will try this out and sorry for the bother. |
Hi,
Thanks you for this awesome tool. Just a question I had was, how would I confirm that after running BAMboozle on my bam file that it actually has been anonymized? Appreciate any help.
The text was updated successfully, but these errors were encountered: