Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add plink converter function #515

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

tristanpwdennis
Copy link

  • Add function for converting data to PLINK format
  • fix variant_allele error in biallelic_snp_calls

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@tristanpwdennis tristanpwdennis marked this pull request as draft March 26, 2024 11:02
@sanjaynagi
Copy link
Collaborator

Hey Tristan. Nice work!

Ill save comments for now but FYI - when you add notebooks to malariagen_data, make sure you have cleared all outputs, otherwise they can become quite hefty in size and then the repo balloons in size over time (all of it is stored in git history).

@tristanpwdennis
Copy link
Author

I've found the source of the AssertionError (also see issue #516) - something to do with how dask.array.map_blocks computes variant_allele at line 1629 of snp_data.py.

I haven't managed to get to the bottom of it yet but in this PR there's a temporary fix that just applies apply_allele_mapping to an in-memory np array of variant_allele, and I've now added biallelic_snp_calls to to_plink.py instead of calling snp_calls and thinning them manually.

@tristanpwdennis
Copy link
Author

I've (I hope!) made a fix to the above error (issue #516), and described it in more detail there

@jonbrenas
Copy link
Collaborator

I think it should work. Feel free to mark it as "Ready for review" when you think that it is appropriate.

Copy link
Collaborator

@jonbrenas jonbrenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tristanpwdennis. Before we merge this PR, could I ask you to do some clean-up?
test-1.ipynb needs to be removed and you made some changes to .gitignore and test_snp_data.py. There are also quite a few print commands that need to be removed or switched to debug mode.
Could you also add a test that checks (at least) that the file is created?

# Filter SNPs for segregating sites only
with self._spinner("Subsetting to segregating sites"):
gt = ds_snps["call_genotype"].data.compute()
print("count alleles")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this print statement?

print("count alleles")
with ProgressBar():
ac = allel.GenotypeArray(gt).count_alleles(max_allele=3)
print("ascertain segregating sites")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here.

& (ac[:, 0] <= max_ref_ac)
& (an_missing <= max_missing_an)
)
print(f"ascertained {np.count_nonzero(loc_sites):,} sites")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here.

print(f"ascertained {np.count_nonzero(loc_sites):,} sites")

# Set up dataset with required vars for plink conversion
print("Set up dataset")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here.

@@ -88,6 +88,7 @@ def test_open_snp_sites(fixture, api: AnophelesSnpData):
assert "variants" in contig_grp
variants = contig_grp["variants"]
assert "POS" in variants
assert False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this is here for.

@jonbrenas
Copy link
Collaborator

Thank you very much @tristanpwdennis. This is great. I added a few comments where there are still some print statement that need to be removed. There is also still a .png file that should be removed. Please feel free to ask if you have a question or if some of the requested changes do not make sense to you.

@tristanpwdennis
Copy link
Author

Hi Jon,
Thanks! Will tidy further and add the test. Cheers :)

@tristanpwdennis
Copy link
Author

tristanpwdennis commented Sep 19, 2024

Hi @jonbrenas, I had a tidy and removed some redundant code from to_plink.py. I also added a test (test_plink_converter.py) to make sure the files are created. Let me know how everything looks & if this is sufficient, or if I can add any more tests. Hope this works ok!
Thanks
-t

@jonbrenas
Copy link
Collaborator

Great job @tristanpwdennis !

I may be misunderstanding the way the PLINK format work but shouldn't it work with any biallelic site and not only with those where the alleles are ref and 1st alt. Is there a reason why, you select only those?

Also, bed_reader needs to be added to the list of packages installed before it is imported. This can be done by modifying the pyproject.toml file.

"sample_id",
"call_genotype",
]
].isel(alleles=slice(0, 2)) # .sel(variants=ix_sites_thinned)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment can probably be deleted.

if os.path.exists(f"{file_path}.bim"):
pass
if os.path.exists(f"{file_path}.fam"):
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should probably be assert os.path.exists(...). I don't think these tests could fail even if the fonction didn't create the correct files.

@jonbrenas
Copy link
Collaborator

jonbrenas commented Sep 20, 2024

It might also be a good idea to have a check that the data in the files is correct for "dummy" data. For example, looking at only the first 5 samples of 'AG1000G-AO' and the region 'X:10_000_000-10_000_500', I find only 3 biallelic sites. With such small data, it should be fairly simple to figure out what the PLINK files should be and check that they are generated correctly. You could, obviously, use any other of "dummy" data that you like.

… check that dimensions and contect of exported dummy data are correct
@tristanpwdennis
Copy link
Author

Hi Jon

Good spot! Selecting the REF and 1st ALT is a relic from before I used .biallelic_snp_calls and had to manually modify the alleles dimension. It didn't change anything being there (e.g. it was selecting the first two alleles of a dataset that already only had two alleles), but it's redundant now and I've removed it.

I've updated the tests to include the feedback above, making sure that the exported dummy data match the data pre-export, both in terms of dimensions and actual content. Please take a look and let me know if there's anything else you think I can add.

Apologies also for the long time taken and revisions needed - you can probably tell that I am quite new to this. Thanks for your feedback and help!

-t

@jonbrenas
Copy link
Collaborator

jonbrenas commented Oct 10, 2024

Thank you very much, @tristanpwdennis. It passes the eye-test so I merged it into a shadow PR to see whether it passes the various tests. It hasn't but I haven't checked yet whether there is something big that is going wrong or if it's the result of small conflicts between recent code. I'll come back to you if there is something significant in the code that needs to be addressed.

@jonbrenas
Copy link
Collaborator

Hi @tristanpwdennis. I think you forgot to add malariagen_data/anoph/base_params.py to this PR.

jonbrenas
jonbrenas previously approved these changes Oct 10, 2024
@jonbrenas jonbrenas dismissed their stale review October 10, 2024 10:06

Not what I wanted to do!

@tristanpwdennis
Copy link
Author

tristanpwdennis commented Oct 15, 2024

Hi Jon, apologies, think I'm missing something - I can see base_params.py in the anoph directory both on GH and my local?

@jonbrenas
Copy link
Collaborator

Hi @tristanpwdennis . Sorry, I had misinterpreted the error. base_params.chunks_default has been made obsolete in the main repository since you created your branch and I thought it was a parameter that you had added. I have corrected it. I will come back to you if there is anything else that needs your input. Thank you for the hard work!

Comment on lines 1672 to 1688
variant_allele_dask = ds_bi["variant_allele"].data
variant_allele_out = dask_apply_allele_mapping(
variant_allele_dask, allele_mapping, max_allele=1

variant_allele = ds_bi["variant_allele"].data
variant_allele = variant_allele.rechunk((variant_allele.chunks[0], -1))

# Chunk allele mapping according to same variant_allele.
allele_mapping_chunked = da.from_array(
allele_mapping, chunks=variant_allele.chunks
)

# Apply allele mapping blockwise to variant_allele.
variant_allele_out = da.map_blocks(
lambda allele, map: apply_allele_mapping(allele, map, max_allele=1),
variant_allele,
allele_mapping_chunked,
dtype=variant_allele.dtype,
chunks=(variant_allele.chunks[0], [2]),

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tristanpwdennis, these changes are no longer necessary as the bug has been fixed in master. The helper function dask_apply_allele_mapping() handles the necessary transformation.

My git-fu is not perfect but there should be a simple way to revert these changes so snp_data.py matches origin/master. Perhaps something like git checkout origin/master -- malariagen_data/anoph/snp_data.py might work?

Copy link
Author

@tristanpwdennis tristanpwdennis Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alimanfoo - sorry - I'm sure you and @jonbrenas can tell I've been struggling a bit here (lack of git-fu, though that website has been helpful). This didn't seem to work as far as I can tell - so any suggestions welcome, thanks! I'll keep looking...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tristanpwdennis, @alimanfoo, I am no expert at git-fu either but I think the easiest solution is to make the change in the shadow PR (where all tests were successful except for the new notebook that uses a local path). As a matter of fact, I did it without any real issue. I would say that, at this point, you have accomplished the job @tristanpwdennis (congrats and thank you, by the way) and all that is left is some light clean-up. @alimanfoo, do you object to considering this PR as a success and closing it (with the option to do more fine-tuning if needed in the shadow PR)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much @jonbrenas and @tristanpwdennis 🙏🏻. Having a quick look over this PR there's a couple of small things I noticed, probably worth iterating a bit more here before calling it done. I'll add some comments tomorrow.

@tristanpwdennis tristanpwdennis force-pushed the plink-converter-2024-03-26 branch 2 times, most recently from 63a64cc to 6335208 Compare October 18, 2024 08:27
Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome! A few minor suggestions...

Comment on lines +65 to +66
if os.path.exists(bed_file_path):
return plink_file_path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly concerned here that if a user changes their mind about something, like what sample sets to use, then reruns this function, the new file will not be written.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly consider adding an overwrite parameter which is True by default, but could be set to False to avoid recomputation.

Comment on lines +84 to +93
# Set up dataset with required vars for plink conversion
ds_snps_asc = ds_snps[
[
"variant_contig",
"variant_position",
"variant_allele",
"sample_id",
"call_genotype",
]
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? Could just access ds_snps directly.

Comment on lines +96 to +100
with self._spinner("Computing genotype ref counts"):
gt_asc = ds_snps_asc["call_genotype"].data.compute()
gn_ref = allel.GenotypeDaskArray(gt_asc).to_n_ref(fill=-127)
with ProgressBar():
gn_ref = gn_ref.compute()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with self._spinner("Computing genotype ref counts"):
gt_asc = ds_snps_asc["call_genotype"].data.compute()
gn_ref = allel.GenotypeDaskArray(gt_asc).to_n_ref(fill=-127)
with ProgressBar():
gn_ref = gn_ref.compute()
with self._dask_progress("Computing genotype ref counts"):
gt_asc = ds_snps_asc["call_genotype"].data # dask array
gn_ref = allel.GenotypeDaskArray(gt_asc).to_n_ref(fill=-127)
gn_ref = gn_ref.compute()

Comment on lines +105 to +109
# Load final data
with ProgressBar():
ds_snps_final = ds_snps_asc[
["variant_contig", "variant_position", "variant_allele", "sample_id"]
].isel(variants=loc_var)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Load final data
with ProgressBar():
ds_snps_final = ds_snps_asc[
["variant_contig", "variant_position", "variant_allele", "sample_id"]
].isel(variants=loc_var)
# Load final data
ds_snps_final = dask_compress_dataset(ds_snps_asc, loc_var, dim="variants")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function dask_compress_dataset() can be imported from the util module, and provides an optimised implementation of selecting from a dataset using a boolean indexer.

Comment on lines +114 to +121
alleles = ds_snps_final["variant_allele"].values
properties = {
"iid": ds_snps_final["sample_id"].values,
"chromosome": ds_snps_final["variant_contig"].values,
"bp_position": ds_snps_final["variant_position"].values,
"allele_1": alleles[:, 0],
"allele_2": alleles[:, 1],
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could consider using a spinner...

Suggested change
alleles = ds_snps_final["variant_allele"].values
properties = {
"iid": ds_snps_final["sample_id"].values,
"chromosome": ds_snps_final["variant_contig"].values,
"bp_position": ds_snps_final["variant_position"].values,
"allele_1": alleles[:, 0],
"allele_2": alleles[:, 1],
}
with self._spinner("Prepare output data"):
alleles = ds_snps_final["variant_allele"].values
properties = {
"iid": ds_snps_final["sample_id"].values,
"chromosome": ds_snps_final["variant_contig"].values,
"bp_position": ds_snps_final["variant_position"].values,
"allele_1": alleles[:, 0],
"allele_2": alleles[:, 1],
}

count_A1=True,
)

print(f"PLINK files written to to: {plink_file_path}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove print statement.


def biallelic_snps_to_plink(
self,
results_dir,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, consider output_dir?

Comment on lines +144 to +145
min_minor_ac: Optional[base_params.min_minor_ac] = 0,
max_missing_an: Optional[base_params.max_missing_an] = 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using same defaults as PCA and NJT here?

Comment on lines +125 to +126
# Test to see if sample_id is exported correctly (stored in the .fam file).
assert set(bed.iid) == set(ds_test.sample_id.values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Test to see if sample_id is exported correctly (stored in the .fam file).
assert set(bed.iid) == set(ds_test.sample_id.values)
# Test to see if sample_id is exported correctly (stored in the .fam file).
assert_array_equal(bed.iid, ds_test.sample_id.values)

Comment on lines +116 to +119
ds_test = api.biallelic_snp_calls(
**data_params,
n_snps=n_snps,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible that the dataset could be slightly different from what is written out by the plink converter, because the plink converter also checks for and removes any rows with all identical genotype calls. But we may not encounter that with the test dataset, so perhaps OK to ignore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants