Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine identical UAS sequences #19

Merged
merged 7 commits into from
May 22, 2020
Merged

Combine identical UAS sequences #19

merged 7 commits into from
May 22, 2020

Conversation

rnmitchell
Copy link
Contributor

This MR will sum the number of reads for identical UAS region sequences for the final annotation table. This is necessary for use in downstream analysis programs when beginning with sequences that include 5' and 3' flanking sequences around the UAS region. The output will contain the same information as the current annotation table.

While the combine feature will be the default, we will also add a --fullseq flag option to the CLI in order to give the user the ability to not combine identical sequences.

@rnmitchell
Copy link
Contributor Author

rnmitchell commented May 15, 2020

Alright @standage I'd like your advice. When I first thought through this issue, I thought it was going to be very easy to identify duplicates. But now that I'm going back and looking at my code... of course I didn't use pandas (facepalm) and the code runs line by line (and also writing to the output file line by line). To me, the simplest (but not as pretty/efficient) fix is to read in the final output file at the end of the loop as a pandas df, find the duplicates, sum the reads and write over the existing output file. The other option is to rewrite the entire main() function to use pandas, which is definitely not as simple, but may be better in the long run. Do you have any other suggestions/thoughts?

@standage
Copy link
Member

Looking forward to our discussion so I can have a better background on everything, but with respect to your question about the main method: I would lean toward refactoring your main method to use Pandas. It will hopefully result in cleaner code, and it will be more maintainable (it's a common widely used library). Happy to contribute there as I can.

@rnmitchell
Copy link
Contributor Author

Yeah, I figured that would be best long term. One thing is that I've read pandas isn't great for iterating through rows. I've used apply before... but not to the large extent that would be required with this script. Any suggestions? (also can discuss after the call tomorrow.)

@standage
Copy link
Member

Building data frames row by row is a pain--we can discuss a couple different strategies there. But iterating through existing rows in a dataframe is simple and something I do routinely. Pandas supports a few different ways of editing existing columns or creating new columns from existing ones, as well as apply-style functional programming methods for rows and columns. I often end up using a mix of procedural (for loops) and functional (apply) approaches when using Pandas.

@rnmitchell
Copy link
Contributor Author

rnmitchell commented May 21, 2020

Ok, this wasn't as painful as I thought it was going to be. I switched everything to a pandas data frame and added an optional --nocombine flag in which a table is outputted without the reads combined for identical UAS region sequences, otherwise the table combines the reads of identical sequences. I need to make some tests for this.

@rnmitchell
Copy link
Contributor Author

Ok, so I added a test which tests for the number of lines in a test file (which contains identical sequences) to test the combine feature. Thoughts on a better way to test this?

@rnmitchell
Copy link
Contributor Author

This is ready for review, @standage. No rush... whenever you have some time!

@rnmitchell rnmitchell requested a review from standage May 22, 2020 14:23
@standage standage marked this pull request as ready for review May 22, 2020 17:02
@standage standage changed the title WIP: combine identical UAS sequences Combine identical UAS sequences May 22, 2020
Copy link
Member

@standage standage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks pretty good. I've made a few suggestions for clarity, but it's mostly just nitpicking to be honest. Keep what you like, and ignore the rest.

lusSTR/annot.py Show resolved Hide resolved
lusSTR/annot.py Outdated Show resolved Hide resolved
lusSTR/annot.py Outdated Show resolved Hide resolved
lusSTR/cli.py Outdated Show resolved Hide resolved
lusSTR/tests/test_suite.py Outdated Show resolved Hide resolved
@rnmitchell
Copy link
Contributor Author

Ok, so I made the changes... but I don't know how to remove the 'Changes requested' tag.

@standage
Copy link
Member

Only the reviewer that requests changes can approve those changes. Done!

@standage standage merged commit 3b28417 into master May 22, 2020
@standage standage deleted the combine branch May 22, 2020 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants