Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the assignment of multiple quantification window coordinates #38

Merged
merged 31 commits into from
Mar 28, 2024

Conversation

Colelyman
Copy link
Collaborator

This PR fixes the assignment of multiple quantification window coordinates (qwc) when provided with multiple amplicons.

Before this PR, if you provided multiple amplicons and specified multiple qwc's the qwc for the second amplicon would be in reference to the first amplicon. With this PR, the qwc for the second (or third, fourth, etc.) amplicon is not with respect to the first amplicon.

One remaining question @kclem before we merge this in, if multiple amplicons are provided, but only one qwc is specified then the qwc for the second amplicon is adjusted to be in reference to the first amplicon (via s1inds). Is this desired behavior?

mbowcut2 and others added 7 commits March 11, 2024 16:15
* refactor errors='ignore' to try except

* refactored integer slice to iloc[]

* moved to_numeric try except to function

* Refactor to_numeric_ignore_errors to to_numeric_ignore_columns

This change is slightly cleaner because it addresses the root issue that some
columns are strings (and can therefore not be converted to numeric types). Now
if an error does occur when converting the dfs to numeric types it won't be
swallowed up.

* Add documentation to to_numeric_ignore_columns

---------

Co-authored-by: Cole Lyman <[email protected]>
The rationale behind this is that the behavior around the cloned amplicon is
quite different than if the qwc are specified directly for the amplicon.
This function just computes the relative indexes without doing an alignment.
mbowcut2 and others added 6 commits March 18, 2024 17:45
…soWGS params (#42)

* Extract out split_interleaved_fastq function to CRISPRessoShared

* Implement splitting interleaved fastq files in CRISPRessoPooled

* Suppress split_interleaved_input from CRISPRessoWGS parameters

* Suppress other parameters in CRISPRessoWGS

* Move where interleaved fastq files are split to be trimmed properly
* - Fixed references to ref_names_for_pe

* removed extra tabs

* trying to match empty line, no tabs

* - changed references to ref_names[0]

* Mckay/pd warnings (#45)

* refactor errors='ignore' to try except

* refactored integer slice to iloc[]

* moved to_numeric try except to function

* Refactor to_numeric_ignore_errors to to_numeric_ignore_columns

This change is slightly cleaner because it addresses the root issue that some
columns are strings (and can therefore not be converted to numeric types). Now
if an error does occur when converting the dfs to numeric types it won't be
swallowed up.

* Add documentation to to_numeric_ignore_columns

---------

Co-authored-by: Cole Lyman <[email protected]>

---------

Co-authored-by: Cole Lyman <[email protected]>
@mbowcut2
Copy link

@Colelyman merged master in and this passes all integration tests. I think we're ready for review and merging. LMK if you want to go over test cases together, want more coverage, etc.

Copy link
Collaborator Author

@Colelyman Colelyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbowcut2 I think this is definitely on the right track and close to being merged in, let me know if you have any questions or objections to the comments!

Comment on lines 193 to 194
# BUG: you can't include the last base in the QW or you get an out of range error -- this is extant beyond the function tested here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be addressed in the PR? Or would it be better addressed in a separate PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS, what would Uncle Bob think of this comment? ;)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this as a reminder. It worked! haha

So, this touches on a key issue: QWC's are (supposedly) 1-indexed. All of these changes assume that downstream calculations correctly utilize 1-index-based coordinates as input by the user, i.e. the user inputs '1-5' --> include_idxs == [1, 2, 3, 4, 5]. The test cases verify that these indexes are selected correctly.

So, I think the only necessary change to get the last char in the sequence would be to append one more index onto s1inds before we slice it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think technically they should be 0-indexed (but a value of 0 will disable the filter).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the documentation "Indexes are 0-based, meaning that the first nucleotide is position 0." Thanks @kclem. So, there is no last bp bug.

However, including 0 in the qwc throws a BadParameterException, I'm assuming because 0 disables the filter. Should we keep this behavior as is?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbowcut2 I think it should behave such that if you have -qwc 0-10_15-22,0 the first amplicon will have quantification windows 0-10 and 15-22 and then the second amplicon will have the quantification windo coordinates disabled (and therefore be set by where the sgRNA aligns).

Also, -qwc 0 will outright disable it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Colelyman okay, it works now.

tests/unit_tests/test_CRISPRessoShared.py Outdated Show resolved Hide resolved
CRISPResso2/CRISPRessoCORE.py Outdated Show resolved Hide resolved
CRISPResso2/CRISPRessoCORE.py Outdated Show resolved Hide resolved
tests/unit_tests/test_CRISPRessoCORE.py Show resolved Hide resolved
CRISPResso2/CRISPRessoCORE.py Outdated Show resolved Hide resolved
tests/unit_tests/test_CRISPRessoCORE.py Outdated Show resolved Hide resolved
Comment on lines 193 to 194
# BUG: you can't include the last base in the QW or you get an out of range error -- this is extant beyond the function tested here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think technically they should be 0-indexed (but a value of 0 will disable the filter).

mbowcut2 and others added 6 commits March 26, 2024 14:57
* Move read filtering to after merging

This is in an effort to be consistent with the behavior and results of
CRISPRessoPooled.

* Properly assign the correct file names for read filtering

* Add space around operators
* Run integration tests on pull_request

* Run pytest on pull_request

* Run pylint on pull_request
* Update report changes

* Switch branch of integration test repo

* Remove extraneous `crispresso_data_path`

* Point integration tests back to master
@Colelyman
Copy link
Collaborator Author

@mbowcut2 This looks good to me! I don't see anything else that needs to be changed, let's wait to merge it in until the other PRs are merged in first.

Thanks!

@Colelyman Colelyman merged commit de30e10 into master Mar 28, 2024
3 checks passed
@Colelyman Colelyman deleted the cole/fix-qwc branch March 28, 2024 20:03
Colelyman added a commit that referenced this pull request Mar 28, 2024
pinellolab#403)

* Mckay/pd warnings (#45)

* refactor errors='ignore' to try except

* refactored integer slice to iloc[]

* moved to_numeric try except to function

* Refactor to_numeric_ignore_errors to to_numeric_ignore_columns

This change is slightly cleaner because it addresses the root issue that some
columns are strings (and can therefore not be converted to numeric types). Now
if an error does occur when converting the dfs to numeric types it won't be
swallowed up.

* Add documentation to to_numeric_ignore_columns

---------



* Extract out quantification window coordinate function

* Refactor get_quant_window_coordinates function into two

The rationale behind this is that the behavior around the cloned amplicon is
quite different than if the qwc are specified directly for the amplicon.

* Handling qwc: add unit tests, refactor some more and add documentation

* Extract out get_relative_coordinates function

This function just computes the relative indexes without doing an alignment.

* Add clarifying unit tests for `get_relative_coordinates`

* Refactor cloned indexes to use ref_positions instead of s1inds

* fixed function for getting cloned qwc idxs

* added tests for cloned qwc function

* Introduce pandas sorting in CRISPRessoCompare (#47)

* Fix interleaved fastq input in CRISPRessoPooled and suppress CRISPRessoWGS params (#42)

* Extract out split_interleaved_fastq function to CRISPRessoShared

* Implement splitting interleaved fastq files in CRISPRessoPooled

* Suppress split_interleaved_input from CRISPRessoWGS parameters

* Suppress other parameters in CRISPRessoWGS

* Move where interleaved fastq files are split to be trimmed properly

* Bug Fix - 367 (#35)

* - Fixed references to ref_names_for_pe

* removed extra tabs

* trying to match empty line, no tabs

* - changed references to ref_names[0]

* Mckay/pd warnings (#45)

* refactor errors='ignore' to try except

* refactored integer slice to iloc[]

* moved to_numeric try except to function

* Refactor to_numeric_ignore_errors to to_numeric_ignore_columns

This change is slightly cleaner because it addresses the root issue that some
columns are strings (and can therefore not be converted to numeric types). Now
if an error does occur when converting the dfs to numeric types it won't be
swallowed up.

* Add documentation to to_numeric_ignore_columns

---------



---------



* removed if check

* implemented last test

* changed NT to BadParameterException

* changed tests, NT to BadParameter exceptions

* Uncomment and correct tests for `get_relative_coordinates`

* finished qwc tests

* 0 is an acceptable qwc

* new get_relative_coords function

* added relative coordinate tests

* removed unused functions

* formatting

* check for 0 qwc

* remove test code

* remove comment

* Move read filtering to after merging in CRISPResso (#39)

* Move read filtering to after merging

This is in an effort to be consistent with the behavior and results of
CRISPRessoPooled.

* Properly assign the correct file names for read filtering

* Add space around operators

* GitHub actions on pr (#51)

* Run integration tests on pull_request

* Run pytest on pull_request

* Run pylint on pull_request

* Run tests on PR only when opening PR (#53)

* Update reports (#52)

* Update report changes

* Switch branch of integration test repo

* Remove extraneous `crispresso_data_path`

* Point integration tests back to master

---------

Co-authored-by: mbowcut2 <[email protected]>
Co-authored-by: McKay <[email protected]>
Co-authored-by: Samuel Nichols <[email protected]>
mbowcut2 added a commit that referenced this pull request Jun 19, 2024
pinellolab#403)

* Mckay/pd warnings (#45)

* refactor errors='ignore' to try except

* refactored integer slice to iloc[]

* moved to_numeric try except to function

* Refactor to_numeric_ignore_errors to to_numeric_ignore_columns

This change is slightly cleaner because it addresses the root issue that some
columns are strings (and can therefore not be converted to numeric types). Now
if an error does occur when converting the dfs to numeric types it won't be
swallowed up.

* Add documentation to to_numeric_ignore_columns

---------



* Extract out quantification window coordinate function

* Refactor get_quant_window_coordinates function into two

The rationale behind this is that the behavior around the cloned amplicon is
quite different than if the qwc are specified directly for the amplicon.

* Handling qwc: add unit tests, refactor some more and add documentation

* Extract out get_relative_coordinates function

This function just computes the relative indexes without doing an alignment.

* Add clarifying unit tests for `get_relative_coordinates`

* Refactor cloned indexes to use ref_positions instead of s1inds

* fixed function for getting cloned qwc idxs

* added tests for cloned qwc function

* Introduce pandas sorting in CRISPRessoCompare (#47)

* Fix interleaved fastq input in CRISPRessoPooled and suppress CRISPRessoWGS params (#42)

* Extract out split_interleaved_fastq function to CRISPRessoShared

* Implement splitting interleaved fastq files in CRISPRessoPooled

* Suppress split_interleaved_input from CRISPRessoWGS parameters

* Suppress other parameters in CRISPRessoWGS

* Move where interleaved fastq files are split to be trimmed properly

* Bug Fix - 367 (#35)

* - Fixed references to ref_names_for_pe

* removed extra tabs

* trying to match empty line, no tabs

* - changed references to ref_names[0]

* Mckay/pd warnings (#45)

* refactor errors='ignore' to try except

* refactored integer slice to iloc[]

* moved to_numeric try except to function

* Refactor to_numeric_ignore_errors to to_numeric_ignore_columns

This change is slightly cleaner because it addresses the root issue that some
columns are strings (and can therefore not be converted to numeric types). Now
if an error does occur when converting the dfs to numeric types it won't be
swallowed up.

* Add documentation to to_numeric_ignore_columns

---------



---------



* removed if check

* implemented last test

* changed NT to BadParameterException

* changed tests, NT to BadParameter exceptions

* Uncomment and correct tests for `get_relative_coordinates`

* finished qwc tests

* 0 is an acceptable qwc

* new get_relative_coords function

* added relative coordinate tests

* removed unused functions

* formatting

* check for 0 qwc

* remove test code

* remove comment

* Move read filtering to after merging in CRISPResso (#39)

* Move read filtering to after merging

This is in an effort to be consistent with the behavior and results of
CRISPRessoPooled.

* Properly assign the correct file names for read filtering

* Add space around operators

* GitHub actions on pr (#51)

* Run integration tests on pull_request

* Run pytest on pull_request

* Run pylint on pull_request

* Run tests on PR only when opening PR (#53)

* Update reports (#52)

* Update report changes

* Switch branch of integration test repo

* Remove extraneous `crispresso_data_path`

* Point integration tests back to master

---------

Co-authored-by: mbowcut2 <[email protected]>
Co-authored-by: McKay <[email protected]>
Co-authored-by: Samuel Nichols <[email protected]>
mbowcut2 added a commit that referenced this pull request Nov 8, 2024
pinellolab#403)

* Mckay/pd warnings (#45)

* refactor errors='ignore' to try except

* refactored integer slice to iloc[]

* moved to_numeric try except to function

* Refactor to_numeric_ignore_errors to to_numeric_ignore_columns

This change is slightly cleaner because it addresses the root issue that some
columns are strings (and can therefore not be converted to numeric types). Now
if an error does occur when converting the dfs to numeric types it won't be
swallowed up.

* Add documentation to to_numeric_ignore_columns

---------

* Extract out quantification window coordinate function

* Refactor get_quant_window_coordinates function into two

The rationale behind this is that the behavior around the cloned amplicon is
quite different than if the qwc are specified directly for the amplicon.

* Handling qwc: add unit tests, refactor some more and add documentation

* Extract out get_relative_coordinates function

This function just computes the relative indexes without doing an alignment.

* Add clarifying unit tests for `get_relative_coordinates`

* Refactor cloned indexes to use ref_positions instead of s1inds

* fixed function for getting cloned qwc idxs

* added tests for cloned qwc function

* Introduce pandas sorting in CRISPRessoCompare (#47)

* Fix interleaved fastq input in CRISPRessoPooled and suppress CRISPRessoWGS params (#42)

* Extract out split_interleaved_fastq function to CRISPRessoShared

* Implement splitting interleaved fastq files in CRISPRessoPooled

* Suppress split_interleaved_input from CRISPRessoWGS parameters

* Suppress other parameters in CRISPRessoWGS

* Move where interleaved fastq files are split to be trimmed properly

* Bug Fix - 367 (#35)

* - Fixed references to ref_names_for_pe

* removed extra tabs

* trying to match empty line, no tabs

* - changed references to ref_names[0]

* Mckay/pd warnings (#45)

* refactor errors='ignore' to try except

* refactored integer slice to iloc[]

* moved to_numeric try except to function

* Refactor to_numeric_ignore_errors to to_numeric_ignore_columns

This change is slightly cleaner because it addresses the root issue that some
columns are strings (and can therefore not be converted to numeric types). Now
if an error does occur when converting the dfs to numeric types it won't be
swallowed up.

* Add documentation to to_numeric_ignore_columns

---------

---------

* removed if check

* implemented last test

* changed NT to BadParameterException

* changed tests, NT to BadParameter exceptions

* Uncomment and correct tests for `get_relative_coordinates`

* finished qwc tests

* 0 is an acceptable qwc

* new get_relative_coords function

* added relative coordinate tests

* removed unused functions

* formatting

* check for 0 qwc

* remove test code

* remove comment

* Move read filtering to after merging in CRISPResso (#39)

* Move read filtering to after merging

This is in an effort to be consistent with the behavior and results of
CRISPRessoPooled.

* Properly assign the correct file names for read filtering

* Add space around operators

* GitHub actions on pr (#51)

* Run integration tests on pull_request

* Run pytest on pull_request

* Run pylint on pull_request

* Run tests on PR only when opening PR (#53)

* Update reports (#52)

* Update report changes

* Switch branch of integration test repo

* Remove extraneous `crispresso_data_path`

* Point integration tests back to master

---------

Co-authored-by: mbowcut2 <[email protected]>
Co-authored-by: McKay <[email protected]>
Co-authored-by: Samuel Nichols <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants