-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combine identical UAS sequences #19
Conversation
Alright @standage I'd like your advice. When I first thought through this issue, I thought it was going to be very easy to identify duplicates. But now that I'm going back and looking at my code... of course I didn't use pandas (facepalm) and the code runs line by line (and also writing to the output file line by line). To me, the simplest (but not as pretty/efficient) fix is to read in the final output file at the end of the loop as a pandas df, find the duplicates, sum the reads and write over the existing output file. The other option is to rewrite the entire main() function to use pandas, which is definitely not as simple, but may be better in the long run. Do you have any other suggestions/thoughts? |
Looking forward to our discussion so I can have a better background on everything, but with respect to your question about the main method: I would lean toward refactoring your main method to use Pandas. It will hopefully result in cleaner code, and it will be more maintainable (it's a common widely used library). Happy to contribute there as I can. |
Yeah, I figured that would be best long term. One thing is that I've read pandas isn't great for iterating through rows. I've used apply before... but not to the large extent that would be required with this script. Any suggestions? (also can discuss after the call tomorrow.) |
Building data frames row by row is a pain--we can discuss a couple different strategies there. But iterating through existing rows in a dataframe is simple and something I do routinely. Pandas supports a few different ways of editing existing columns or creating new columns from existing ones, as well as apply-style functional programming methods for rows and columns. I often end up using a mix of procedural (for loops) and functional (apply) approaches when using Pandas. |
Ok, this wasn't as painful as I thought it was going to be. I switched everything to a pandas data frame and added an optional |
Ok, so I added a test which tests for the number of lines in a test file (which contains identical sequences) to test the combine feature. Thoughts on a better way to test this? |
This is ready for review, @standage. No rush... whenever you have some time! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks pretty good. I've made a few suggestions for clarity, but it's mostly just nitpicking to be honest. Keep what you like, and ignore the rest.
Ok, so I made the changes... but I don't know how to remove the 'Changes requested' tag. |
Only the reviewer that requests changes can approve those changes. Done! |
This MR will sum the number of reads for identical UAS region sequences for the final annotation table. This is necessary for use in downstream analysis programs when beginning with sequences that include 5' and 3' flanking sequences around the UAS region. The output will contain the same information as the current annotation table.
While the combine feature will be the default, we will also add a
--fullseq
flag option to the CLI in order to give the user the ability to not combine identical sequences.