-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding --seed
flag to customize the seed
when downsampling
#29
Conversation
seed
to config dictionary for snakemakeseed
flag to customize the seed
when downsampling
seed
flag to customize the seed
when downsampling--seed
flag to customize the seed
when downsampling
yeat/tests/test_cli.py
Outdated
df = pd.read_csv(quast_report, sep="\t") | ||
assert 61 <= df.iloc[12]["sample_contigs"] <= 91 # 76 +-20% of avg num_contigs | ||
assert 4183 <= df.iloc[13]["sample_contigs"] <= 6273 # 5228 +-20% of avg largest_contig | ||
assert 59515 <= df.iloc[14]["sample_contigs"] <= 89271 # 74393 +-20% of avg total_len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my take on this suggestion.
I was pretty liberal on my +- buffer range to catch the randomness from randint()
when downsampling.
The way I determined my medium for each assert was:
- I ran the above list of arguments 5 times,
- took the average for
num_contigs
,largest_contig
, andtotal_len
and - calculated the buffer +- 20% caps.
Above the function, there is a decorator. When this function is executed with pytest, the function is called 3 times. Since, the seed is random by default, we do not need to specify the seed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! A couple comments.
I have no idea what information df.iloc[12]["sample_contigs"]
stores. There is a describing its contents at the end of the line, but comments have a habit of coming out of sync with the code they are intended to describe. Probably better for legibility and clarity to assign those values to descriptive variable names before the assertion tests.
I think the x <= var <= y
is a clear construction, but another you may consider uses pytest.approx
. I use this most frequently to test the value of floating point numbers, for which simple ==
equality tests often fail (even if you're looking for an "exact" value, you have to specify some level of tolerance). But you can apply the same idea here, and just specify a wide tolerance. The first line would then become something like this, which is a pretty clear representation of 76 +/- 15.
assert num_contigs == pytest.approx(76, abs=15)
environment.yml
Outdated
- fastp>=0.23 | ||
- fastqc>=0.11 | ||
- gzip>=1.7 | ||
- mash>=2.3 | ||
- megahit>=1.2 | ||
- pytest-cov>=3.0 | ||
- python>=3.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YEAT cannot install if the user's python version is < 3.9. Added this to allow users to upgrade if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to add or update an entry in the change log describing why only Python >=3.9 is supported now.
Okay, code is ready for review! Let me know if you have any questions or concerns. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. See my comments below.
environment.yml
Outdated
- fastp>=0.23 | ||
- fastqc>=0.11 | ||
- gzip>=1.7 | ||
- mash>=2.3 | ||
- megahit>=1.2 | ||
- pytest-cov>=3.0 | ||
- python>=3.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to add or update an entry in the change log describing why only Python >=3.9 is supported now.
yeat/tests/test_cli.py
Outdated
df = pd.read_csv(quast_report, sep="\t") | ||
assert 61 <= df.iloc[12]["sample_contigs"] <= 91 # 76 +-20% of avg num_contigs | ||
assert 4183 <= df.iloc[13]["sample_contigs"] <= 6273 # 5228 +-20% of avg largest_contig | ||
assert 59515 <= df.iloc[14]["sample_contigs"] <= 89271 # 74393 +-20% of avg total_len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! A couple comments.
I have no idea what information df.iloc[12]["sample_contigs"]
stores. There is a describing its contents at the end of the line, but comments have a habit of coming out of sync with the code they are intended to describe. Probably better for legibility and clarity to assign those values to descriptive variable names before the assertion tests.
I think the x <= var <= y
is a clear construction, but another you may consider uses pytest.approx
. I use this most frequently to test the value of floating point numbers, for which simple ==
equality tests often fail (even if you're looking for an "exact" value, you have to specify some level of tolerance). But you can apply the same idea here, and just specify a wide tolerance. The first line would then become something like this, which is a pretty clear representation of 76 +/- 15.
assert num_contigs == pytest.approx(76, abs=15)
- fastp>=0.23 | ||
- fastqc>=0.11 | ||
- gzip>=1.7 | ||
- mash>=2.3 | ||
- megahit>=1.2 | ||
- pytest-cov>=3.0 | ||
- python=3.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Came across a very interesting situation with python version compatibilities with other packages.
I have enforced users to either to upgrade or downgrade to 3.9
. It is important that they do this because, in order for us to use the error_on_exit
parameter, we need at least 3.9
. Currently, the highest python version you can install with conda is 3.10
.
https://anaconda.org/anaconda/python
However, version 3.10
has incompatibility issues with all version of SPAdes unless you are on version 3.5.4
and above!
As of right now, the highest version that conda has available at this time is version 3.5.5
for linux and 3.5.2
for iOS. This is huge problem for iOS users because both SPAdes and Unicycler will fail if the users have python version 3.10
.
@@ -4,13 +4,14 @@ channels: | |||
- bioconda | |||
- defaults | |||
dependencies: | |||
- black=21.10b0 | |||
- black=22.10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Black version 21.10b0
has package incompatibilities errors with newer versions of click
. If a user has click version >8.1
, Black will crash with:
ImportError: cannot import name '_unicodefun' from 'click'
To fix this, users will need to downgrade click down to 8.0
.
This problem has been fixed in Black 22.3
and up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't much matter which version of Black is used, as long as it's used consistently. So you're welcome to upgrade and pin a newer version that doesn't have these issues. But that's often best left to a dedicated thread, since it can result in numerous trivial formatting changes that add a lot of noise and clutter to an existing PR.
Okay! A couple of comments on version pinning, added suggestions (Thanks!), and updated change log. Everything is ready to go! |
The purpose of this PR is to resolve #27 by adding a
--seed
flag to ensure that the original testtest_custom_downsample_input()
could reproduce the same number of contigs, config lengths, and total length.Other additions to this PR include:
README.md
,>v3.9
because of theexit_on_error
parameter that was introduced in Improved auto downsampling with custom coverage and avg read length #23 as well.