Adding `--seed` flag to customize the `seed` when downsampling #29

danejo3 · 2022-11-02T17:31:24Z

The purpose of this PR is to resolve #27 by adding a --seed flag to ensure that the original test test_custom_downsample_input() could reproduce the same number of contigs, config lengths, and total length.

Other additions to this PR include:

a small update to the README.md,
an additional test testing for an acceptable range of expected values because of the random seed that was introduced in Improved auto downsampling with custom coverage and avg read length #23 , and
enforcing users to have python >v3.9 because of the exit_on_error parameter that was introduced in Improved auto downsampling with custom coverage and avg read length #23 as well.

yeat/Snakefile

danejo3 · 2022-11-03T17:48:49Z

yeat/tests/test_cli.py

+    df = pd.read_csv(quast_report, sep="\t")
+    assert 61 <= df.iloc[12]["sample_contigs"] <= 91  # 76 +-20% of avg num_contigs
+    assert 4183 <= df.iloc[13]["sample_contigs"] <= 6273  # 5228 +-20% of avg largest_contig
+    assert 59515 <= df.iloc[14]["sample_contigs"] <= 89271  # 74393 +-20% of avg total_len


#27 (comment)

This is my take on this suggestion.

I was pretty liberal on my +- buffer range to catch the randomness from randint() when downsampling.

The way I determined my medium for each assert was:

I ran the above list of arguments 5 times,

took the average for num_contigs, largest_contig, and total_len and

calculated the buffer +- 20% caps.

Above the function, there is a decorator. When this function is executed with pytest, the function is called 3 times. Since, the seed is random by default, we do not need to specify the seed.

Looks good! A couple comments.

I have no idea what information df.iloc[12]["sample_contigs"] stores. There is a describing its contents at the end of the line, but comments have a habit of coming out of sync with the code they are intended to describe. Probably better for legibility and clarity to assign those values to descriptive variable names before the assertion tests.

I think the x <= var <= y is a clear construction, but another you may consider uses pytest.approx. I use this most frequently to test the value of floating point numbers, for which simple == equality tests often fail (even if you're looking for an "exact" value, you have to specify some level of tolerance). But you can apply the same idea here, and just specify a wide tolerance. The first line would then become something like this, which is a pretty clear representation of 76 +/- 15.

assert num_contigs == pytest.approx(76, abs=15)

danejo3 · 2022-11-03T17:56:15Z

environment.yml

    - fastp>=0.23
    - fastqc>=0.11
    - gzip>=1.7
    - mash>=2.3
    - megahit>=1.2
    - pytest-cov>=3.0
+    - python>=3.9


YEAT cannot install if the user's python version is < 3.9. Added this to allow users to upgrade if needed.

Might want to add or update an entry in the change log describing why only Python >=3.9 is supported now.

danejo3 · 2022-11-03T18:01:50Z

Okay, code is ready for review! Let me know if you have any questions or concerns. Thanks

standage

LGTM. See my comments below.

standage · 2022-11-04T13:37:41Z

environment.yml

    - fastp>=0.23
    - fastqc>=0.11
    - gzip>=1.7
    - mash>=2.3
    - megahit>=1.2
    - pytest-cov>=3.0
+    - python>=3.9


Might want to add or update an entry in the change log describing why only Python >=3.9 is supported now.

yeat/cli.py

standage · 2022-11-04T13:59:22Z

yeat/tests/test_cli.py

+    df = pd.read_csv(quast_report, sep="\t")
+    assert 61 <= df.iloc[12]["sample_contigs"] <= 91  # 76 +-20% of avg num_contigs
+    assert 4183 <= df.iloc[13]["sample_contigs"] <= 6273  # 5228 +-20% of avg largest_contig
+    assert 59515 <= df.iloc[14]["sample_contigs"] <= 89271  # 74393 +-20% of avg total_len


Looks good! A couple comments.

I have no idea what information df.iloc[12]["sample_contigs"] stores. There is a describing its contents at the end of the line, but comments have a habit of coming out of sync with the code they are intended to describe. Probably better for legibility and clarity to assign those values to descriptive variable names before the assertion tests.

I think the x <= var <= y is a clear construction, but another you may consider uses pytest.approx. I use this most frequently to test the value of floating point numbers, for which simple == equality tests often fail (even if you're looking for an "exact" value, you have to specify some level of tolerance). But you can apply the same idea here, and just specify a wide tolerance. The first line would then become something like this, which is a pretty clear representation of 76 +/- 15.

assert num_contigs == pytest.approx(76, abs=15)

danejo3 · 2022-11-04T16:13:28Z

environment.yml

    - fastp>=0.23
    - fastqc>=0.11
    - gzip>=1.7
    - mash>=2.3
    - megahit>=1.2
    - pytest-cov>=3.0
+    - python=3.9


Came across a very interesting situation with python version compatibilities with other packages.

I have enforced users to either to upgrade or downgrade to 3.9. It is important that they do this because, in order for us to use the error_on_exit parameter, we need at least 3.9. Currently, the highest python version you can install with conda is 3.10.

https://anaconda.org/anaconda/python

However, version 3.10 has incompatibility issues with all version of SPAdes unless you are on version 3.5.4 and above!

ablab/spades#863

As of right now, the highest version that conda has available at this time is version 3.5.5 for linux and 3.5.2 for iOS. This is huge problem for iOS users because both SPAdes and Unicycler will fail if the users have python version 3.10.

https://anaconda.org/bioconda/spades

danejo3 · 2022-11-04T16:20:37Z

environment.yml

@@ -4,13 +4,14 @@ channels:
    - bioconda
    - defaults
 dependencies:
-    - black=21.10b0
+    - black=22.10


Black version 21.10b0 has package incompatibilities errors with newer versions of click. If a user has click version >8.1, Black will crash with:

ImportError: cannot import name '_unicodefun' from 'click'

To fix this, users will need to downgrade click down to 8.0.

This problem has been fixed in Black 22.3 and up.

psf/black#2964

It doesn't much matter which version of Black is used, as long as it's used consistently. So you're welcome to upgrade and pin a newer version that doesn't have these issues. But that's often best left to a dedicated thread, since it can result in numerous trivial formatting changes that add a lot of noise and clutter to an existing PR.

danejo3 · 2022-11-04T16:41:09Z

Okay! A couple of comments on version pinning, added suggestions (Thanks!), and updated change log. Everything is ready to go!

first commit

484189c

danejo3 commented Nov 2, 2022

View reviewed changes

yeat/Snakefile Outdated Show resolved Hide resolved

small changes

1c85bd4

danejo3 changed the title ~~Adding seed to config dictionary for snakemake~~ Adding seed flag to customize the seed when downsampling Nov 2, 2022

danejo3 changed the title ~~Adding seed flag to customize the seed when downsampling~~ Adding --seed flag to customize the seed when downsampling Nov 2, 2022

danejo3 added 2 commits November 3, 2022 13:35

added tests

0d2d842

updated changelog

bffb135

danejo3 commented Nov 3, 2022

View reviewed changes

cleaning up code

5c88ad8

standage requested changes Nov 4, 2022

View reviewed changes

implemented suggestions

e4a5e31

danejo3 commented Nov 4, 2022

View reviewed changes

updating env.yml

4f1c230

danejo3 commented Nov 4, 2022

View reviewed changes

danejo3 added 3 commits November 4, 2022 12:33

updated changedlog

85e676c

added comments

b0fb2e5

added comments

fb2d765

standage approved these changes Nov 4, 2022

View reviewed changes

standage merged commit b2cd761 into main Nov 4, 2022

standage deleted the fix-downsample-test branch November 4, 2022 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding `--seed` flag to customize the `seed` when downsampling #29

Adding `--seed` flag to customize the `seed` when downsampling #29

danejo3 commented Nov 2, 2022 •

edited

Loading

danejo3 Nov 3, 2022 •

edited

Loading

standage Nov 4, 2022

danejo3 Nov 3, 2022 •

edited

Loading

standage Nov 4, 2022

danejo3 commented Nov 3, 2022

standage left a comment

standage Nov 4, 2022

standage Nov 4, 2022

danejo3 Nov 4, 2022 •

edited

Loading

danejo3 Nov 4, 2022 •

edited

Loading

standage Nov 4, 2022

danejo3 commented Nov 4, 2022

Adding --seed flag to customize the seed when downsampling #29

Adding --seed flag to customize the seed when downsampling #29

Conversation

danejo3 commented Nov 2, 2022 • edited Loading

danejo3 Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

standage Nov 4, 2022

Choose a reason for hiding this comment

danejo3 Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

standage Nov 4, 2022

Choose a reason for hiding this comment

danejo3 commented Nov 3, 2022

standage left a comment

Choose a reason for hiding this comment

standage Nov 4, 2022

Choose a reason for hiding this comment

standage Nov 4, 2022

Choose a reason for hiding this comment

danejo3 Nov 4, 2022 • edited Loading

Choose a reason for hiding this comment

danejo3 Nov 4, 2022 • edited Loading

Choose a reason for hiding this comment

standage Nov 4, 2022

Choose a reason for hiding this comment

danejo3 commented Nov 4, 2022

Adding `--seed` flag to customize the `seed` when downsampling #29

Adding `--seed` flag to customize the `seed` when downsampling #29

danejo3 commented Nov 2, 2022 •

edited

Loading

danejo3 Nov 3, 2022 •

edited

Loading

danejo3 Nov 3, 2022 •

edited

Loading

danejo3 Nov 4, 2022 •

edited

Loading

danejo3 Nov 4, 2022 •

edited

Loading