-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add plink converter function #515
base: master
Are you sure you want to change the base?
Add plink converter function #515
Conversation
tristanpwdennis
commented
Mar 26, 2024
- Add function for converting data to PLINK format
- fix variant_allele error in biallelic_snp_calls
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Hey Tristan. Nice work! Ill save comments for now but FYI - when you add notebooks to malariagen_data, make sure you have cleared all outputs, otherwise they can become quite hefty in size and then the repo balloons in size over time (all of it is stored in git history). |
…s, avoids using dask
…variant_allele mapping
I've found the source of the AssertionError (also see issue #516) - something to do with how I haven't managed to get to the bottom of it yet but in this PR there's a temporary fix that just applies |
Removed numba decorator for the apply_allele_mapping function, for now
I've (I hope!) made a fix to the above error (issue #516), and described it in more detail there |
I think it should work. Feel free to mark it as "Ready for review" when you think that it is appropriate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tristanpwdennis. Before we merge this PR, could I ask you to do some clean-up?
test-1.ipynb
needs to be removed and you made some changes to .gitignore
and test_snp_data.py
. There are also quite a few print
commands that need to be removed or switched to debug mode.
Could you also add a test that checks (at least) that the file is created?
malariagen_data/anoph/to_plink.py
Outdated
# Filter SNPs for segregating sites only | ||
with self._spinner("Subsetting to segregating sites"): | ||
gt = ds_snps["call_genotype"].data.compute() | ||
print("count alleles") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this print
statement?
malariagen_data/anoph/to_plink.py
Outdated
print("count alleles") | ||
with ProgressBar(): | ||
ac = allel.GenotypeArray(gt).count_alleles(max_allele=3) | ||
print("ascertain segregating sites") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here.
malariagen_data/anoph/to_plink.py
Outdated
& (ac[:, 0] <= max_ref_ac) | ||
& (an_missing <= max_missing_an) | ||
) | ||
print(f"ascertained {np.count_nonzero(loc_sites):,} sites") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here.
malariagen_data/anoph/to_plink.py
Outdated
print(f"ascertained {np.count_nonzero(loc_sites):,} sites") | ||
|
||
# Set up dataset with required vars for plink conversion | ||
print("Set up dataset") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here.
tests/anoph/test_snp_data.py
Outdated
@@ -88,6 +88,7 @@ def test_open_snp_sites(fixture, api: AnophelesSnpData): | |||
assert "variants" in contig_grp | |||
variants = contig_grp["variants"] | |||
assert "POS" in variants | |||
assert False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what this is here for.
Thank you very much @tristanpwdennis. This is great. I added a few comments where there are still some |
Hi Jon, |
Hi @jonbrenas, I had a tidy and removed some redundant code from to_plink.py. I also added a test (test_plink_converter.py) to make sure the files are created. Let me know how everything looks & if this is sufficient, or if I can add any more tests. Hope this works ok! |
Great job @tristanpwdennis ! I may be misunderstanding the way the PLINK format work but shouldn't it work with any biallelic site and not only with those where the alleles are ref and 1st alt. Is there a reason why, you select only those? Also, bed_reader needs to be added to the list of packages installed before it is imported. This can be done by modifying the |
malariagen_data/anoph/to_plink.py
Outdated
"sample_id", | ||
"call_genotype", | ||
] | ||
].isel(alleles=slice(0, 2)) # .sel(variants=ix_sites_thinned) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment can probably be deleted.
tests/anoph/test_plink_converter.py
Outdated
if os.path.exists(f"{file_path}.bim"): | ||
pass | ||
if os.path.exists(f"{file_path}.fam"): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should probably be assert os.path.exists(...)
. I don't think these tests could fail even if the fonction didn't create the correct files.
It might also be a good idea to have a check that the data in the files is correct for "dummy" data. For example, looking at only the first 5 samples of 'AG1000G-AO' and the region 'X:10_000_000-10_000_500', I find only 3 biallelic sites. With such small data, it should be fairly simple to figure out what the PLINK files should be and check that they are generated correctly. You could, obviously, use any other of "dummy" data that you like. |
… check that dimensions and contect of exported dummy data are correct
Hi Jon Good spot! Selecting the REF and 1st ALT is a relic from before I used .biallelic_snp_calls and had to manually modify the alleles dimension. It didn't change anything being there (e.g. it was selecting the first two alleles of a dataset that already only had two alleles), but it's redundant now and I've removed it. I've updated the tests to include the feedback above, making sure that the exported dummy data match the data pre-export, both in terms of dimensions and actual content. Please take a look and let me know if there's anything else you think I can add. Apologies also for the long time taken and revisions needed - you can probably tell that I am quite new to this. Thanks for your feedback and help! -t |
Thank you very much, @tristanpwdennis. It passes the eye-test so I merged it into a shadow PR to see whether it passes the various tests. It hasn't but I haven't checked yet whether there is something big that is going wrong or if it's the result of small conflicts between recent code. I'll come back to you if there is something significant in the code that needs to be addressed. |
Hi @tristanpwdennis. I think you forgot to add |
Hi Jon, apologies, think I'm missing something - I can see base_params.py in the anoph directory both on GH and my local? |
Hi @tristanpwdennis . Sorry, I had misinterpreted the error. |
malariagen_data/anoph/snp_data.py
Outdated
variant_allele_dask = ds_bi["variant_allele"].data | ||
variant_allele_out = dask_apply_allele_mapping( | ||
variant_allele_dask, allele_mapping, max_allele=1 | ||
|
||
variant_allele = ds_bi["variant_allele"].data | ||
variant_allele = variant_allele.rechunk((variant_allele.chunks[0], -1)) | ||
|
||
# Chunk allele mapping according to same variant_allele. | ||
allele_mapping_chunked = da.from_array( | ||
allele_mapping, chunks=variant_allele.chunks | ||
) | ||
|
||
# Apply allele mapping blockwise to variant_allele. | ||
variant_allele_out = da.map_blocks( | ||
lambda allele, map: apply_allele_mapping(allele, map, max_allele=1), | ||
variant_allele, | ||
allele_mapping_chunked, | ||
dtype=variant_allele.dtype, | ||
chunks=(variant_allele.chunks[0], [2]), | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tristanpwdennis, these changes are no longer necessary as the bug has been fixed in master. The helper function dask_apply_allele_mapping()
handles the necessary transformation.
My git-fu is not perfect but there should be a simple way to revert these changes so snp_data.py
matches origin/master. Perhaps something like git checkout origin/master -- malariagen_data/anoph/snp_data.py
might work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @alimanfoo - sorry - I'm sure you and @jonbrenas can tell I've been struggling a bit here (lack of git-fu, though that website has been helpful). This didn't seem to work as far as I can tell - so any suggestions welcome, thanks! I'll keep looking...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tristanpwdennis, @alimanfoo, I am no expert at git-fu either but I think the easiest solution is to make the change in the shadow PR (where all tests were successful except for the new notebook that uses a local path). As a matter of fact, I did it without any real issue. I would say that, at this point, you have accomplished the job @tristanpwdennis (congrats and thank you, by the way) and all that is left is some light clean-up. @alimanfoo, do you object to considering this PR as a success and closing it (with the option to do more fine-tuning if needed in the shadow PR)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much @jonbrenas and @tristanpwdennis 🙏🏻. Having a quick look over this PR there's a couple of small things I noticed, probably worth iterating a bit more here before calling it done. I'll add some comments tomorrow.
63a64cc
to
6335208
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks awesome! A few minor suggestions...
if os.path.exists(bed_file_path): | ||
return plink_file_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly concerned here that if a user changes their mind about something, like what sample sets to use, then reruns this function, the new file will not be written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly consider adding an overwrite
parameter which is True by default, but could be set to False
to avoid recomputation.
# Set up dataset with required vars for plink conversion | ||
ds_snps_asc = ds_snps[ | ||
[ | ||
"variant_contig", | ||
"variant_position", | ||
"variant_allele", | ||
"sample_id", | ||
"call_genotype", | ||
] | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? Could just access ds_snps
directly.
with self._spinner("Computing genotype ref counts"): | ||
gt_asc = ds_snps_asc["call_genotype"].data.compute() | ||
gn_ref = allel.GenotypeDaskArray(gt_asc).to_n_ref(fill=-127) | ||
with ProgressBar(): | ||
gn_ref = gn_ref.compute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with self._spinner("Computing genotype ref counts"): | |
gt_asc = ds_snps_asc["call_genotype"].data.compute() | |
gn_ref = allel.GenotypeDaskArray(gt_asc).to_n_ref(fill=-127) | |
with ProgressBar(): | |
gn_ref = gn_ref.compute() | |
with self._dask_progress("Computing genotype ref counts"): | |
gt_asc = ds_snps_asc["call_genotype"].data # dask array | |
gn_ref = allel.GenotypeDaskArray(gt_asc).to_n_ref(fill=-127) | |
gn_ref = gn_ref.compute() |
# Load final data | ||
with ProgressBar(): | ||
ds_snps_final = ds_snps_asc[ | ||
["variant_contig", "variant_position", "variant_allele", "sample_id"] | ||
].isel(variants=loc_var) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Load final data | |
with ProgressBar(): | |
ds_snps_final = ds_snps_asc[ | |
["variant_contig", "variant_position", "variant_allele", "sample_id"] | |
].isel(variants=loc_var) | |
# Load final data | |
ds_snps_final = dask_compress_dataset(ds_snps_asc, loc_var, dim="variants") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function dask_compress_dataset()
can be imported from the util
module, and provides an optimised implementation of selecting from a dataset using a boolean indexer.
alleles = ds_snps_final["variant_allele"].values | ||
properties = { | ||
"iid": ds_snps_final["sample_id"].values, | ||
"chromosome": ds_snps_final["variant_contig"].values, | ||
"bp_position": ds_snps_final["variant_position"].values, | ||
"allele_1": alleles[:, 0], | ||
"allele_2": alleles[:, 1], | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could consider using a spinner...
alleles = ds_snps_final["variant_allele"].values | |
properties = { | |
"iid": ds_snps_final["sample_id"].values, | |
"chromosome": ds_snps_final["variant_contig"].values, | |
"bp_position": ds_snps_final["variant_position"].values, | |
"allele_1": alleles[:, 0], | |
"allele_2": alleles[:, 1], | |
} | |
with self._spinner("Prepare output data"): | |
alleles = ds_snps_final["variant_allele"].values | |
properties = { | |
"iid": ds_snps_final["sample_id"].values, | |
"chromosome": ds_snps_final["variant_contig"].values, | |
"bp_position": ds_snps_final["variant_position"].values, | |
"allele_1": alleles[:, 0], | |
"allele_2": alleles[:, 1], | |
} |
count_A1=True, | ||
) | ||
|
||
print(f"PLINK files written to to: {plink_file_path}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove print statement.
|
||
def biallelic_snps_to_plink( | ||
self, | ||
results_dir, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, consider output_dir
?
min_minor_ac: Optional[base_params.min_minor_ac] = 0, | ||
max_missing_an: Optional[base_params.max_missing_an] = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using same defaults as PCA and NJT here?
# Test to see if sample_id is exported correctly (stored in the .fam file). | ||
assert set(bed.iid) == set(ds_test.sample_id.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Test to see if sample_id is exported correctly (stored in the .fam file). | |
assert set(bed.iid) == set(ds_test.sample_id.values) | |
# Test to see if sample_id is exported correctly (stored in the .fam file). | |
assert_array_equal(bed.iid, ds_test.sample_id.values) |
ds_test = api.biallelic_snp_calls( | ||
**data_params, | ||
n_snps=n_snps, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible that the dataset could be slightly different from what is written out by the plink converter, because the plink converter also checks for and removes any rows with all identical genotype calls. But we may not encounter that with the test dataset, so perhaps OK to ignore.