Add plink converter function #515

tristanpwdennis · 2024-03-26T11:01:53Z

Add function for converting data to PLINK format
fix variant_allele error in biallelic_snp_calls

review-notebook-app · 2024-03-26T11:01:58Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

sanjaynagi · 2024-03-26T13:19:42Z

Hey Tristan. Nice work!

Ill save comments for now but FYI - when you add notebooks to malariagen_data, make sure you have cleared all outputs, otherwise they can become quite hefty in size and then the repo balloons in size over time (all of it is stored in git history).

…s, avoids using dask

…variant_allele mapping

tristanpwdennis · 2024-04-11T10:58:50Z

I've found the source of the AssertionError (also see issue #516) - something to do with how dask.array.map_blocks computes variant_allele at line 1629 of snp_data.py.

I haven't managed to get to the bottom of it yet but in this PR there's a temporary fix that just applies apply_allele_mapping to an in-memory np array of variant_allele, and I've now added biallelic_snp_calls to to_plink.py instead of calling snp_calls and thinning them manually.

Removed numba decorator for the apply_allele_mapping function, for now

tristanpwdennis · 2024-08-08T08:32:57Z

I've (I hope!) made a fix to the above error (issue #516), and described it in more detail there

jonbrenas · 2024-08-08T14:11:20Z

I think it should work. Feel free to mark it as "Ready for review" when you think that it is appropriate.

jonbrenas

Hi @tristanpwdennis. Before we merge this PR, could I ask you to do some clean-up?
test-1.ipynb needs to be removed and you made some changes to .gitignore and test_snp_data.py. There are also quite a few print commands that need to be removed or switched to debug mode.
Could you also add a test that checks (at least) that the file is created?

jonbrenas · 2024-09-11T08:24:12Z

malariagen_data/anoph/to_plink.py

+        # Filter SNPs for segregating sites only
+        with self._spinner("Subsetting to segregating sites"):
+            gt = ds_snps["call_genotype"].data.compute()
+            print("count alleles")


Can you remove this print statement?

jonbrenas · 2024-09-11T08:24:27Z

malariagen_data/anoph/to_plink.py

+            print("count alleles")
+            with ProgressBar():
+                ac = allel.GenotypeArray(gt).count_alleles(max_allele=3)
+                print("ascertain segregating  sites")


Same thing here.

jonbrenas · 2024-09-11T08:24:43Z

malariagen_data/anoph/to_plink.py

+                    & (ac[:, 0] <= max_ref_ac)
+                    & (an_missing <= max_missing_an)
+                )
+                print(f"ascertained {np.count_nonzero(loc_sites):,} sites")


Same thing here.

jonbrenas · 2024-09-11T08:24:50Z

malariagen_data/anoph/to_plink.py

+                print(f"ascertained {np.count_nonzero(loc_sites):,} sites")
+
+        # Set up dataset with required vars for plink conversion
+        print("Set up dataset")


Same thing here.

jonbrenas · 2024-09-11T08:26:07Z

tests/anoph/test_snp_data.py

@@ -88,6 +88,7 @@ def test_open_snp_sites(fixture, api: AnophelesSnpData):
        assert "variants" in contig_grp
        variants = contig_grp["variants"]
        assert "POS" in variants
+        assert False


Not sure what this is here for.

jonbrenas · 2024-09-11T08:29:10Z

Thank you very much @tristanpwdennis. This is great. I added a few comments where there are still some print statement that need to be removed. There is also still a .png file that should be removed. Please feel free to ask if you have a question or if some of the requested changes do not make sense to you.

tristanpwdennis · 2024-09-11T08:30:40Z

Hi Jon,
Thanks! Will tidy further and add the test. Cheers :)

tristanpwdennis · 2024-09-19T04:34:47Z

Hi @jonbrenas, I had a tidy and removed some redundant code from to_plink.py. I also added a test (test_plink_converter.py) to make sure the files are created. Let me know how everything looks & if this is sufficient, or if I can add any more tests. Hope this works ok!
Thanks
-t

jonbrenas · 2024-09-20T17:24:22Z

Great job @tristanpwdennis !

I may be misunderstanding the way the PLINK format work but shouldn't it work with any biallelic site and not only with those where the alleles are ref and 1st alt. Is there a reason why, you select only those?

Also, bed_reader needs to be added to the list of packages installed before it is imported. This can be done by modifying the pyproject.toml file.

jonbrenas · 2024-09-20T17:25:10Z

malariagen_data/anoph/to_plink.py

+                "sample_id",
+                "call_genotype",
+            ]
+        ].isel(alleles=slice(0, 2))  # .sel(variants=ix_sites_thinned)


The comment can probably be deleted.

jonbrenas · 2024-09-20T17:29:13Z

tests/anoph/test_plink_converter.py

+    if os.path.exists(f"{file_path}.bim"):
+        pass
+    if os.path.exists(f"{file_path}.fam"):
+        pass


These should probably be assert os.path.exists(...). I don't think these tests could fail even if the fonction didn't create the correct files.

jonbrenas · 2024-09-20T17:40:54Z

It might also be a good idea to have a check that the data in the files is correct for "dummy" data. For example, looking at only the first 5 samples of 'AG1000G-AO' and the region 'X:10_000_000-10_000_500', I find only 3 biallelic sites. With such small data, it should be fairly simple to figure out what the PLINK files should be and check that they are generated correctly. You could, obviously, use any other of "dummy" data that you like.

… check that dimensions and contect of exported dummy data are correct

tristanpwdennis · 2024-10-07T03:35:30Z

Hi Jon

Good spot! Selecting the REF and 1st ALT is a relic from before I used .biallelic_snp_calls and had to manually modify the alleles dimension. It didn't change anything being there (e.g. it was selecting the first two alleles of a dataset that already only had two alleles), but it's redundant now and I've removed it.

I've updated the tests to include the feedback above, making sure that the exported dummy data match the data pre-export, both in terms of dimensions and actual content. Please take a look and let me know if there's anything else you think I can add.

Apologies also for the long time taken and revisions needed - you can probably tell that I am quite new to this. Thanks for your feedback and help!

-t

jonbrenas · 2024-10-10T08:17:06Z

Thank you very much, @tristanpwdennis. It passes the eye-test so I merged it into a shadow PR to see whether it passes the various tests. It hasn't but I haven't checked yet whether there is something big that is going wrong or if it's the result of small conflicts between recent code. I'll come back to you if there is something significant in the code that needs to be addressed.

jonbrenas · 2024-10-10T10:06:14Z

Hi @tristanpwdennis. I think you forgot to add malariagen_data/anoph/base_params.py to this PR.

Not what I wanted to do!

tristanpwdennis · 2024-10-15T22:44:55Z

Hi Jon, apologies, think I'm missing something - I can see base_params.py in the anoph directory both on GH and my local?

jonbrenas · 2024-10-16T07:43:27Z

Hi @tristanpwdennis . Sorry, I had misinterpreted the error. base_params.chunks_default has been made obsolete in the main repository since you created your branch and I thought it was a parameter that you had added. I have corrected it. I will come back to you if there is anything else that needs your input. Thank you for the hard work!

alimanfoo · 2024-10-16T08:31:31Z

malariagen_data/anoph/snp_data.py

-            variant_allele_dask = ds_bi["variant_allele"].data
-            variant_allele_out = dask_apply_allele_mapping(
-                variant_allele_dask, allele_mapping, max_allele=1
+
+            variant_allele = ds_bi["variant_allele"].data
+            variant_allele = variant_allele.rechunk((variant_allele.chunks[0], -1))
+
+            # Chunk allele mapping according to same variant_allele.
+            allele_mapping_chunked = da.from_array(
+                allele_mapping, chunks=variant_allele.chunks
+            )
+
+            # Apply allele mapping blockwise to variant_allele.
+            variant_allele_out = da.map_blocks(
+                lambda allele, map: apply_allele_mapping(allele, map, max_allele=1),
+                variant_allele,
+                allele_mapping_chunked,
+                dtype=variant_allele.dtype,
+                chunks=(variant_allele.chunks[0], [2]),
+


Hi @tristanpwdennis, these changes are no longer necessary as the bug has been fixed in master. The helper function dask_apply_allele_mapping() handles the necessary transformation.

My git-fu is not perfect but there should be a simple way to revert these changes so snp_data.py matches origin/master. Perhaps something like git checkout origin/master -- malariagen_data/anoph/snp_data.py might work?

Hi @alimanfoo - sorry - I'm sure you and @jonbrenas can tell I've been struggling a bit here (lack of git-fu, though that website has been helpful). This didn't seem to work as far as I can tell - so any suggestions welcome, thanks! I'll keep looking...

Hi @tristanpwdennis, @alimanfoo, I am no expert at git-fu either but I think the easiest solution is to make the change in the shadow PR (where all tests were successful except for the new notebook that uses a local path). As a matter of fact, I did it without any real issue. I would say that, at this point, you have accomplished the job @tristanpwdennis (congrats and thank you, by the way) and all that is left is some light clean-up. @alimanfoo, do you object to considering this PR as a success and closing it (with the option to do more fine-tuning if needed in the shadow PR)?

Thanks so much @jonbrenas and @tristanpwdennis 🙏🏻. Having a quick look over this PR there's a couple of small things I noticed, probably worth iterating a bit more here before calling it done. I'll add some comments tomorrow.

alimanfoo

Looks awesome! A few minor suggestions...

alimanfoo · 2024-10-18T08:47:34Z

malariagen_data/anoph/to_plink.py

+        if os.path.exists(bed_file_path):
+            return plink_file_path


Slightly concerned here that if a user changes their mind about something, like what sample sets to use, then reruns this function, the new file will not be written.

Possibly consider adding an overwrite parameter which is True by default, but could be set to False to avoid recomputation.

alimanfoo · 2024-10-18T08:54:05Z

malariagen_data/anoph/to_plink.py

+        # Set up dataset with required vars for plink conversion
+        ds_snps_asc = ds_snps[
+            [
+                "variant_contig",
+                "variant_position",
+                "variant_allele",
+                "sample_id",
+                "call_genotype",
+            ]
+        ]


Is this necessary? Could just access ds_snps directly.

alimanfoo · 2024-10-18T08:58:37Z

malariagen_data/anoph/to_plink.py

+        with self._spinner("Computing genotype ref counts"):
+            gt_asc = ds_snps_asc["call_genotype"].data.compute()
+            gn_ref = allel.GenotypeDaskArray(gt_asc).to_n_ref(fill=-127)
+            with ProgressBar():
+                gn_ref = gn_ref.compute()


Suggested change

with self._spinner("Computing genotype ref counts"):

gt_asc = ds_snps_asc["call_genotype"].data.compute()

gn_ref = allel.GenotypeDaskArray(gt_asc).to_n_ref(fill=-127)

with ProgressBar():

gn_ref = gn_ref.compute()

with self._dask_progress("Computing genotype ref counts"):

gt_asc = ds_snps_asc["call_genotype"].data # dask array

gn_ref = allel.GenotypeDaskArray(gt_asc).to_n_ref(fill=-127)

gn_ref = gn_ref.compute()

alimanfoo · 2024-10-18T09:04:07Z

malariagen_data/anoph/to_plink.py

+        # Load final data
+        with ProgressBar():
+            ds_snps_final = ds_snps_asc[
+                ["variant_contig", "variant_position", "variant_allele", "sample_id"]
+            ].isel(variants=loc_var)


Suggested change

# Load final data

with ProgressBar():

ds_snps_final = ds_snps_asc[

["variant_contig", "variant_position", "variant_allele", "sample_id"]

].isel(variants=loc_var)

# Load final data

ds_snps_final = dask_compress_dataset(ds_snps_asc, loc_var, dim="variants")

The function dask_compress_dataset() can be imported from the util module, and provides an optimised implementation of selecting from a dataset using a boolean indexer.

alimanfoo · 2024-10-18T09:07:06Z

malariagen_data/anoph/to_plink.py

+        alleles = ds_snps_final["variant_allele"].values
+        properties = {
+            "iid": ds_snps_final["sample_id"].values,
+            "chromosome": ds_snps_final["variant_contig"].values,
+            "bp_position": ds_snps_final["variant_position"].values,
+            "allele_1": alleles[:, 0],
+            "allele_2": alleles[:, 1],
+        }


Could consider using a spinner...

Suggested change

alleles = ds_snps_final["variant_allele"].values

properties = {

"iid": ds_snps_final["sample_id"].values,

"chromosome": ds_snps_final["variant_contig"].values,

"bp_position": ds_snps_final["variant_position"].values,

"allele_1": alleles[:, 0],

"allele_2": alleles[:, 1],

}

with self._spinner("Prepare output data"):

alleles = ds_snps_final["variant_allele"].values

properties = {

"iid": ds_snps_final["sample_id"].values,

"chromosome": ds_snps_final["variant_contig"].values,

"bp_position": ds_snps_final["variant_position"].values,

"allele_1": alleles[:, 0],

"allele_2": alleles[:, 1],

}

alimanfoo · 2024-10-18T09:07:30Z

malariagen_data/anoph/to_plink.py

+            count_A1=True,
+        )
+
+        print(f"PLINK files written to to: {plink_file_path}")


Remove print statement.

alimanfoo · 2024-10-18T09:09:10Z

malariagen_data/anoph/to_plink.py

+
+    def biallelic_snps_to_plink(
+        self,
+        results_dir,


Nit, consider output_dir?

alimanfoo · 2024-10-18T09:11:34Z

malariagen_data/anoph/to_plink.py

+        min_minor_ac: Optional[base_params.min_minor_ac] = 0,
+        max_missing_an: Optional[base_params.max_missing_an] = 0,


Consider using same defaults as PCA and NJT here?

alimanfoo · 2024-10-18T09:18:57Z

tests/anoph/test_plink_converter.py

+    # Test to see if sample_id is exported correctly (stored in the .fam file).
+    assert set(bed.iid) == set(ds_test.sample_id.values)


Suggested change

# Test to see if sample_id is exported correctly (stored in the .fam file).

assert set(bed.iid) == set(ds_test.sample_id.values)

# Test to see if sample_id is exported correctly (stored in the .fam file).

assert_array_equal(bed.iid, ds_test.sample_id.values)

alimanfoo · 2024-10-18T09:22:56Z

tests/anoph/test_plink_converter.py

+    ds_test = api.biallelic_snp_calls(
+        **data_params,
+        n_snps=n_snps,
+    )


It is possible that the dataset could be slightly different from what is written out by the plink converter, because the plink converter also checks for and removes any rows with all identical genotype calls. But we may not encounter that with the test dataset, so perhaps OK to ignore.

add plink converyt function, faff around with some tests

ea54da7

tristanpwdennis marked this pull request as draft March 26, 2024 11:02

Tristan Dennis added 2 commits April 11, 2024 09:44

temp fix for avoiding error in mapping all-allele -> biallelic allele…

6afd608

…s, avoids using dask

updated plink converter to use biallelic_snp_calls with tmp. amended …

3268640

…variant_allele mapping

tristanpwdennis and others added 3 commits August 8, 2024 08:22

Merge branch 'master' into plink-converter-2024-03-26

ab6f28c

Added chunking of allele_mapping array so that da.map blocks works

5e502bf

Removed numba decorator for the apply_allele_mapping function, for now

numba works fine

31f6469

clear up notebook

4ef5e9c

tristanpwdennis marked this pull request as ready for review August 9, 2024 07:12

alimanfoo mentioned this pull request Aug 9, 2024

'AssertionError' when trying to return 'variant_allele' from biallelic_snp_calls() #516

Closed

alimanfoo mentioned this pull request Aug 19, 2024

Add plink converter function (shadow PR) #584

Open

Merge branch 'master' into plink-converter-2024-03-26

98767fc

jonbrenas requested changes Aug 20, 2024

View reviewed changes

remove dud notebook, tidy up to_plink.py

af47433

jonbrenas reviewed Sep 11, 2024

View reviewed changes

Tristan Dennis added 3 commits September 19, 2024 10:47

add plink converter test

96eb157

update plink converter test to include test for created files

98700e3

tidy, add tests, refactor snp_data call

6db9db8

jonbrenas reviewed Sep 20, 2024

View reviewed changes

Tidy tests and remove allelism selection from converter, add tests to…

03ab293

… check that dimensions and contect of exported dummy data are correct

Tristan Dennis and others added 2 commits October 7, 2024 11:39

move bedreader to bottom of pyproject list

e7a35c8

Merge branch 'master' into plink-converter-2024-03-26

201f9d4

jonbrenas previously approved these changes Oct 10, 2024

View reviewed changes

Merge branch 'malariagen:master' into plink-converter-2024-03-26

7823580

Change the default chunks value.

6335208

alimanfoo reviewed Oct 16, 2024

View reviewed changes

tristanpwdennis force-pushed the plink-converter-2024-03-26 branch 2 times, most recently from 63a64cc to 6335208 Compare October 18, 2024 08:27

checkout changes from upstream to snpdata.py

c84cd1a

alimanfoo reviewed Oct 18, 2024

View reviewed changes

		min_minor_ac: Optional[base_params.min_minor_ac] = 0,
		max_missing_an: Optional[base_params.max_missing_an] = 0,

		# Test to see if sample_id is exported correctly (stored in the .fam file).
		assert set(bed.iid) == set(ds_test.sample_id.values)

Add plink converter function #515

Are you sure you want to change the base?

Add plink converter function #515

Conversation

tristanpwdennis commented Mar 26, 2024

review-notebook-app bot commented Mar 26, 2024

sanjaynagi commented Mar 26, 2024

tristanpwdennis commented Apr 11, 2024

tristanpwdennis commented Aug 8, 2024

jonbrenas commented Aug 8, 2024

jonbrenas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonbrenas commented Sep 11, 2024

tristanpwdennis commented Sep 11, 2024

tristanpwdennis commented Sep 19, 2024 • edited Loading

jonbrenas commented Sep 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonbrenas commented Sep 20, 2024 • edited Loading

tristanpwdennis commented Oct 7, 2024

jonbrenas commented Oct 10, 2024 • edited Loading

jonbrenas commented Oct 10, 2024

tristanpwdennis commented Oct 15, 2024 • edited Loading

jonbrenas commented Oct 16, 2024

Choose a reason for hiding this comment

tristanpwdennis Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alimanfoo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tristanpwdennis commented Sep 19, 2024 •

edited

Loading

jonbrenas commented Sep 20, 2024 •

edited

Loading

jonbrenas commented Oct 10, 2024 •

edited

Loading

tristanpwdennis commented Oct 15, 2024 •

edited

Loading

tristanpwdennis Oct 17, 2024 •

edited

Loading