-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature support: stripped-down replacement for the old GATK ROD system #224
Conversation
yay! On Mon, Feb 23, 2015 at 6:02 PM, droazen [email protected] wrote:
|
There are currently a few known issues that reviewers should bear in mind (these will become tickets shortly): -All codecs marked as implementing -The - - -All codecs should move to tribble, and must implement - |
Coverage decreased (-2.67%) to 74.54% when pulling d006568b0244cc85ea9092d0b96d2b7d29230bd5 on dr_feature_support into f93c4b6 on master. |
d006568
to
d3acbf0
Compare
Coverage decreased (-2.64%) to 74.58% when pulling d3acbf0f56c789f9ac98239e6f85001afbc65720 on dr_feature_support into f93c4b6 on master. |
d3acbf0
to
39b2902
Compare
Set<Class<?>> concreteClassSet = new HashSet<>(); | ||
|
||
for ( Class<?> clazz : classes ) { | ||
if ( ! Modifier.isAbstract(clazz.getModifiers()) && ! Modifier.isInterface(clazz.getModifiers()) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract a static method isConcrete(Class c) and use it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
Coverage decreased (-2.64%) to 74.58% when pulling 39b2902dae243d7dea480313ac1ab4b6d7bb15d9 on dr_feature_support into f93c4b6 on master. |
// If the Collection's parameterized type is itself parameterized (eg., List<Foo<Bar>>), | ||
// return the raw type of the outer parameter (Foo.class, in this example) to avoid a | ||
// ClassCastException. Otherwise, return the Collection's type parameter directly as a Class. | ||
return (Class<?>) (genericTypes[0] instanceof ParameterizedType ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment why the 0-th items in genericTypes is special
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not special -- it's simply the first (and only) type parameter (notice that a few lines earlier we've checked that the Collection has only a single type parameter). Will clarify in the comments.
39b2902
to
11667a7
Compare
* Gets all Features from the sources represented by the provided FeatureInput arguments that overlap | ||
* this FeatureContext's query interval. | ||
* | ||
* Returned Features are not guaranteed to be in any particular order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they may also not be unique
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add a note to that effect
Coverage decreased (-2.64%) to 74.58% when pulling 11667a7fb7f7273d77bc25c34aa12a4862f71706 on dr_feature_support into f93c4b6 on master. |
public String get(String columnName) { | ||
int position = keys.indexOf(columnName); | ||
if (position < 0) throw new IllegalArgumentException("We don't have a column named " + columnName); | ||
return values.get(position); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should delegate to getValue. and the two methods should have unified names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am hesitant to make changes to codecs that are not currently covered by tests. This kind of refactoring should wait until TableCodecUnitTest exists.
* @return true if all records overlapping the provided interval are already contained in our cache, otherwise false | ||
*/ | ||
public boolean cacheHit( final GenomeLoc interval ) { | ||
return cachedContig != null && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a method on GenomeLoc that implements this functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't use GenomeLoc here due to the sequence dictionary requirement (we can't assume the presence of a sequence dictionary in this part of the engine).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a todo or a note in #61 and we can fix it when we pull the overlap methods up out of GenomeLoc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean #100, but will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that wast the one I meant..
/** | ||
* True if the file backing this data source has an accompanying index file, false if it doesn't | ||
*/ | ||
private boolean hasIndex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be made final. Maybe other fields too. It's always a good idea to make things final if possible. Makes reading code simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
i like it. i want to get rid of many of those codecs though, let's make sure that only the useful ones survive |
* Gets all Features from the source represented by the provided FeatureInput argument that overlap | ||
* this FeatureContext's query interval. | ||
* | ||
* Returned Features are not guaranteed to be in any particular order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then it should probably return a set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Features are not guaranteed to be unique, so a set would be inappropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MultiSet then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would require a 3rd-party collections framework plus updating all client code that we port. Not worth it.
1f25f62
to
5e7efc1
Compare
Coverage decreased (-2.62%) to 74.61% when pulling 5e7efc1c114c403a8157878dd1874016540902ae on dr_feature_support into 993ba00 on master. |
All comments addressed -- squashing and merging now. |
Replicates most of the functionality of the old ROD system in ~5% of the code. The incomprehensible tangle of nested iterators, bindings, views, states, tracks, trackers, builders etc., etc., is gone, replaced by about 4 core classes: FeatureContext, FeatureDataSource, FeatureInput, and FeatureManager. FeatureContext: This is tool-facing interface (replaces RefMetaDataTracker). Allows particular sources of Features to be queried. FeatureDataSource: Handles the low-level details of querying a source of Features. Uses a caching scheme optimized for the use case of queries over intervals with gradually increasing start/stop positions. FeatureInput: This is used to declare Feature arguments in tools (replaces RodBinding). The engine discovers all FeatureInput arguments declared in the tool's class hierarchy, and initializes data sources for each one that was specified on the command line. FeatureManager: Manages the pool of data sources, as well as codec and file format discovery and type checking. -ReadWalker interface has changed: apply() now takes a FeatureContext argument (will be null if there are no sources of Features). -Included an example tool PrintReadsWithVariants to demonstrate use of the new ReadWalker interface. -Since Feature files must be indexed in order to query them, I have provided a tool IndexFeatureFile that can index any Feature-containing file. -Made required changes to the argument-parsing system. Feature argument discovery is as de-coupled as possible from the main arg parser. -Made required changes to BQSR, and eliminated the temporary HACKRefMetaDataTracker. -Comprehensive tests
5e7efc1
to
f7c1911
Compare
Feature support: stripped-down replacement for the old GATK ROD system
* add bash code to chunk files at 15tb limit * update dockstore.yml * fix yml formatting * fill in importLoadTable inputs json * take out variable in comment * remove unnecessary files in task * printf for scientific notation * scientific notation * increase memory to check if memory issue * remove gsutil copy log * fix gsutil cp destination directory * decrease memory after testing for increased memory fails * testing bq job id * test sed pattern matching * take out mv command for testing purposes * update move to done files command to within each set * troubleshoot bq load command * update project for inputs json * revert parsing of job id * test * write load status to separate file * sed file to get job ID * test2 * fix bq wait status * separate wait status command into pieces * reorganize output block * move wait command into main loop * test taking out wait command entirely * add task jsut for bq wait testing * set project for bq wait * put everything back now that wait works with project * re commit with updated comments and cleaned up echo commands * take out call to test task * mv files from original dir to set dirs instead of copy * trouble shoot wait_status * change back to cp for testing in copying sets step * take out output file for bq wait * add back input file to wait statement * add in more tmp files in output * try to capture success in bq wait * reorg output file for while loop * fix sed command * clean up for final test - mv files from original to sets * update full import genomes wdl * take out un-used inputs in load table task * put back in the variables that control calls in order * fixing lockfile exists comment * add quotes to escape backtick in echo * take out hard coded pet table in gsutil du * single quote echo to block expression expansion in lockfile * multi line echo comment for lockfile * fix missing quotes * set dockstore yml back to what it should be for ah_var_store branch * remove testing wdl and json files * set single file with all sum and sets * fix sed variable format * testing pattern matching * fix reading wrong input file in loop * adding output files * test * fix quotes on awk statement * awk * clean up comments and print statements * edit comments
* add bash code to chunk files at 15tb limit * update dockstore.yml * fix yml formatting * fill in importLoadTable inputs json * take out variable in comment * remove unnecessary files in task * printf for scientific notation * scientific notation * increase memory to check if memory issue * remove gsutil copy log * fix gsutil cp destination directory * decrease memory after testing for increased memory fails * testing bq job id * test sed pattern matching * take out mv command for testing purposes * update move to done files command to within each set * troubleshoot bq load command * update project for inputs json * revert parsing of job id * test * write load status to separate file * sed file to get job ID * test2 * fix bq wait status * separate wait status command into pieces * reorganize output block * move wait command into main loop * test taking out wait command entirely * add task jsut for bq wait testing * set project for bq wait * put everything back now that wait works with project * re commit with updated comments and cleaned up echo commands * take out call to test task * mv files from original dir to set dirs instead of copy * trouble shoot wait_status * change back to cp for testing in copying sets step * take out output file for bq wait * add back input file to wait statement * add in more tmp files in output * try to capture success in bq wait * reorg output file for while loop * fix sed command * clean up for final test - mv files from original to sets * update full import genomes wdl * take out un-used inputs in load table task * put back in the variables that control calls in order * fixing lockfile exists comment * add quotes to escape backtick in echo * take out hard coded pet table in gsutil du * single quote echo to block expression expansion in lockfile * multi line echo comment for lockfile * fix missing quotes * set dockstore yml back to what it should be for ah_var_store branch * remove testing wdl and json files * set single file with all sum and sets * fix sed variable format * testing pattern matching * fix reading wrong input file in loop * adding output files * test * fix quotes on awk statement * awk * clean up comments and print statements * edit comments
Replicates most of the functionality of the old ROD system in ~5% of the
code. The incomprehensible tangle of nested iterators, bindings, views, states,
tracks, trackers, builders etc., etc., is gone, replaced by about 4 core classes:
FeatureContext, FeatureDataSource, FeatureInput, and FeatureManager.
FeatureContext: This is tool-facing interface (replaces RefMetaDataTracker).
Allows particular sources of Features to be queried.
FeatureDataSource: Handles the low-level details of querying a source of Features.
Uses a caching scheme optimized for the use case of queries over
intervals with gradually increasing start/stop positions.
FeatureInput: This is used to declare Feature arguments in tools (replaces RodBinding).
The engine discovers all FeatureInput arguments declared in the tool's class
hierarchy, and initializes data sources for each one that was specified
on the command line.
FeatureManager: Manages the pool of data sources, as well as codec and file format
discovery and type checking.
-ReadWalker interface has changed: apply() now takes a FeatureContext argument
(will be null if there are no sources of Features).
-Included an example tool PrintReadsWithVariants to demonstrate use of the new
ReadWalker interface.
-Since Feature files must be indexed in order to query them, I have provided a
tool IndexFeatureFile that can index any Feature-containing file.
-Made required changes to the argument-parsing system. Feature argument discovery
is as de-coupled as possible from the main arg parser.
-Made required changes to BQSR, and eliminated the temporary HACKRefMetaDataTracker.
-Comprehensive tests