Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Path options for SAMFileWriterFactory #840

Merged

Conversation

jean-philippe-martin
Copy link
Contributor

Description

Add overloads to SAMFileWriterFactory that take Path as input instead of File. The old version that took File now just redirect to their Path equivalent, so there's the same amount of code to test.

The motivation for this change is to support Java NIO's pluggable file systems. This change (in combination with Google's NIO provider) allows one to directly write a SAM, BAM, or CRAM file to a Google Cloud Storage bucket. See eg. GATK's matching issue.

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary (not necessary)
  • Is backward compatible (does not break binary or source compatibility)

@codecov-io
Copy link

codecov-io commented Apr 4, 2017

Codecov Report

Merging #840 into master will decrease coverage by 0.037%.
The diff coverage is 56.757%.

@@               Coverage Diff               @@
##              master      #840       +/-   ##
===============================================
- Coverage     66.003%   65.966%   -0.037%     
- Complexity      7460      7470       +10     
===============================================
  Files            529       529               
  Lines          31962     31992       +30     
  Branches        5463      5464        +1     
===============================================
+ Hits           21096     21104        +8     
- Misses          8724      8742       +18     
- Partials        2142      2146        +4
Impacted Files Coverage Δ Complexity Δ
.../samtools/StreamInflatingIndexingOutputStream.java 0% <0%> (ø) 0 <0> (ø) ⬇️
src/main/java/htsjdk/samtools/BAMIndexer.java 76.768% <0%> (-4.953%) 17 <0> (ø)
src/main/java/htsjdk/samtools/util/IOUtil.java 31.715% <100%> (+0.445%) 65 <4> (+4) ⬆️
...rc/main/java/htsjdk/samtools/util/BinaryCodec.java 67.89% <33.333%> (-2.671%) 55 <1> (ø)
...ain/java/htsjdk/samtools/BinaryBAMIndexWriter.java 69.663% <66.667%> (-0.452%) 14 <1> (+1)
src/main/java/htsjdk/samtools/BAMFileWriter.java 54.167% <66.667%> (+1.419%) 14 <1> (-1) ⬇️
...ain/java/htsjdk/samtools/SAMFileWriterFactory.java 62.759% <72.973%> (-1.212%) 47 <25> (+6)
src/main/java/htsjdk/samtools/SAMTextWriter.java 69.318% <0%> (-6.818%) 12% <0%> (ø)
... and 3 more

Copy link
Member

@magicDGS magicDGS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very useful for me! In addition, I would appreciate if:

  • Code paths related with filenames use toURI().toString() instead of toString(). In my case, I find it more useful for debugging that the returned string representtion.
  • When I was working in other parts of the code to support Path, I found quite useful the Jmfs to test non-file paths. It will be cool to have that for SAM/BAM/CRAM writing too!

}

// write the BAM records to a temporary CRAM
final File tempCRAMFile = File.createTempFile("testBAMThroughCRAMRoundTrip", CramIO.CRAM_FILE_EXTENSION);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for the Path tests will be better to use the Jimfs (in-memory filesystem).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent idea! I see we already depend on jimfs.

I'm also adding it to the test of IOUtil.addExtension, so we can test non-default-filesystem paths.

I also changed testCRAMThroughBAMRoundTrip so it saves to jimfs instead of the filesystem, that should speed things up a little bit.

This leaves only one test using files. I'm thinking of leaving it so we exercise that path through the code, though we can remove it if you feel strongly about it (these methods just call .toPath() and then redirect to the path code anyways).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion was related to use at least in the changed test the jims filesystem, but anyway I think that some of the tests should still have File, which is still quite common.

final File referenceFasta) {
OutputStream cramOS = null;
OutputStream indexOS = null ;

if (createIndex) {
if (!IOUtil.isRegularPath(outputFile)) {
log.warn("Cannot create index for CRAM because output file is not a regular file: " + outputFile.getAbsolutePath());
log.warn("Cannot create index for CRAM because output file is not a regular file: " + outputFile);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would not be better to use outputFile.toURI() for easier debugging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done!

}
catch (final IOException ioe) {
throw new RuntimeIOException("Error creating index file for: " + outputFile.getAbsolutePath()+ BAMIndex.BAMIndexSuffix);
throw new RuntimeIOException("Error creating index file for: " + indexPath);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the URI will be useful also here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done.

}
catch (final IOException ioe) {
throw new RuntimeIOException("Error creating CRAM file: " + outputFile.getAbsolutePath());
throw new RuntimeIOException("Error creating CRAM file: " + outputFile.toString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the URI will be useful also here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

@@ -303,7 +342,7 @@ public SAMFileWriter makeSAMWriter(final SAMFileHeader header, final boolean pre
if (this.useAsyncIo) return new AsyncSAMFileWriter(ret, this.asyncOutputBufferSize);
else return ret;
} catch (final IOException ioe) {
throw new RuntimeIOException("Error opening file: " + outputFile.getAbsolutePath());
throw new RuntimeIOException("Error opening file: " + outputPath);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You got it!

final boolean createIndex = this.createIndex && IOUtil.isRegularPath(outputFile);
OutputStream os = IOUtil.maybeBufferOutputStream(Files.newOutputStream(outputPath), bufferSize);
if (createMd5File) os = new Md5CalculatingOutputStream(os, IOUtil.addExtension(outputPath,".md5"));
final BAMFileWriter ret = new BAMFileWriter(os, outputPath.toString(), compressionLevel, deflaterFactory);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URI string better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Done!

@jean-philippe-martin
Copy link
Contributor Author

Thank you, excellent suggestions! I'm glad this code will be useful to you. Let me know what you think of this revised version!

@jean-philippe-martin
Copy link
Contributor Author

I see that storing the file name as a URL broke the tests; but that's a good thing because it looks like this uncovered usage that wouldn't have worked when using other filesystems. I've added a test for SAMFileWriterFactory that uses jimfs, and updated the code to be able to handle Path objects in a few more places to make this work.

@@ -26,7 +27,7 @@ public StreamInflatingIndexingOutputStream(final OutputStream s1, final File ind
this.s1 = s1;
this.s2 = new PipedOutputStream();
final PipedInputStream pin = new PipedInputStream(this.s2, Defaults.NON_ZERO_BUFFER_SIZE);
this.thread = new Thread(new Indexer(indexFile, pin), "BamIndexingThread");
this.thread = new Thread(new Indexer(indexFile.toPath(), pin), "BamIndexingThread");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be possible to add a constructor for Path also here, and delegate the Fileone to that one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Assert.assertEquals(IOUtil.addExtension(p, ".ext"), IOUtil.getPath("/folder/file.ext"));
p = IOUtil.getPath("folder/file");
Assert.assertEquals(IOUtil.addExtension(p, ".ext"), IOUtil.getPath("folder/file.ext"));
FileSystem jimfs = Jimfs.newFileSystem(Configuration.unix());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jmfs should be closed. I guess that try-with-resources is the better option for that...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@magicDGS
Copy link
Member

magicDGS commented Apr 5, 2017

The Path support in this PR is great, thanks for doing this @jean-philippe-martin! For me this is perfect, but let's see the suggestions from the Broad team.

I added two more comments that I found while reviewing the latest changes, one related with closing jimfs and the other regarding another ctor for Path. Thank you for taking into account my comments!

@jean-philippe-martin
Copy link
Contributor Author

You're welcome @magicDGS, thank you for your helpful comments!

@jean-philippe-martin
Copy link
Contributor Author

If I understand correctly we're waiting for a thumbs-up from someone at Broad. @droazen or @lbergelson perhaps?

@droazen
Copy link
Contributor

droazen commented Apr 6, 2017

I'll have a look later this week.

@droazen droazen self-assigned this Apr 6, 2017
@droazen droazen self-requested a review April 6, 2017 00:24
@jean-philippe-martin
Copy link
Contributor Author

@droazen gentle ping. This is needed for NIO writes.

@jean-philippe-martin
Copy link
Contributor Author

@droazen weekly ping. We need this to be able to make progress.

@jean-philippe-martin
Copy link
Contributor Author

@droazen ? @lbergelson ? @tfenne ? Is there anyone here? Hello? I can't add good cloud support support to GATK if every commit takes 25 days.

@droazen
Copy link
Contributor

droazen commented Apr 28, 2017

@jean-philippe-martin Sorry for the delay on this review -- we've been dealing with some unexpected and urgent internal deadlines lately, as I discussed at our last group meeting, and have not had much time for htsjdk reviews. Have to ask you to be patient a bit longer, but we'll get to this.

@yfarjoun
Copy link
Contributor

this needs a review....

@jean-philippe-martin
Copy link
Contributor Author

@droazen - customary reminder that this code is waiting for review. I see conflicts are starting to creep in.

@jean-philippe-martin
Copy link
Contributor Author

@lbergelson if you start this week we may get a few rounds of review before I leave for my vacation.

@jean-philippe-martin
Copy link
Contributor Author

@droazen @lbergelson Customary reminder that this code is still waiting for review.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jean-philippe-martin Finally, after much delay.... a review!

Looks pretty good overall. A few nitpicks and requests for more tests.

I do have one serious worry about introducing new NPEs with the new method overloads. I think those should be made to pass through nulls, or we should add tests that show that they can't accept nulls in the first place. We definitely introduced bugs during earlier path expansions because we didn't think about nulls as a viable input for a file.

@@ -75,7 +75,13 @@ protected BAMFileWriter(final OutputStream os, final File file, final int compre
outputBinaryCodec.setOutputFileName(getPathString(file));
}

private void prepareToWriteAlignments() {
protected BAMFileWriter(final OutputStream os, final String aboluteFilename, final int compressionLevel, final DeflaterFactory deflaterFactory) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo, aboluteFilename -> absoluteFilename

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! Fixed.

if (indexFile.exists()) {
if (!indexFile.canWrite()) {
throw new SAMException("Not creating BAM index since unable to write index file " + indexFile);
final String indexFileBase = pathURI.toString().endsWith(BamFileIoUtils.BAM_FILE_EXTENSION) ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a relic of a previous refactoring, pathUri is already a String.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed toString()

@@ -59,6 +60,24 @@ public BinaryBAMIndexWriter(final int nRef, final File output) {
}

/**
* constructor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to write in constructor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the word "constructor" (I think that's what you mean?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is what I mean

* @param nRef Number of reference sequences
* @param output BAM Index output file
*/
public BinaryBAMIndexWriter(final int nRef, final Path output) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the existing file constructor delegate to this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done!

* @param fileHeader header for the corresponding bam file
*/
public BAMIndexer(final File output, final SAMFileHeader fileHeader) {
this(output.toPath(), fileHeader);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes me uncomfortable that these can all throw NullPointerException now. We've been bitten by newly introduced NPR's in other places we've done this conversion.

In most cases it will probably NPR later if a null is passed in, but HTSJDK isn't very consistent about how nulls are handled and null is sometimes a valid value in surprising places.

I think we'd be safer if we wrapped all of these in a ternary expression. Maybe we should add IOUtil.toPath(File) that returns a path or null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Changing it here.

final SAMFileHeader samHeader) {

// NOTE: even when the input is coord-sorted, using assumePresorted=false will cause some
// tests to fail since it can change the order of some unmapped reads - AFAICT this is allowed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right about the ordering issue I believe.

// index only created if coordinate sorted
header.setSortOrder(SAMFileHeader.SortOrder.coordinate);
header.addSequence(new SAMSequenceRecord("chr1", 123));
final SAMFileWriter writer = factory.makeBAMWriter(header, false, outputPath);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like it should be a try() with resources

@@ -150,6 +168,18 @@ private void createSmallBam(final File outputFile) {
writer.close();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be a try with resources

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed here and below.

if (verifySupplementalFiles) {
final Path index = SamFiles.findIndex(output);
final Path md5File = IOUtil.addExtension(output, ".md5");
Assert.assertTrue(Files.size(index) > 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe extract this out as a helper method since it's duplicated. Although there's a ton of duplicated code in these test files anyway...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text is there twice, but one of them uses a File and other other uses a Path. Not much there to deduplicate.

@@ -170,6 +175,23 @@ public void testFileTypeUnix(final String path, boolean expectedIsRegularFile) {
Assert.assertEquals(IOUtil.isRegularPath(file), expectedIsRegularFile);
}

@Test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a unit test for isRegularPath(path)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@jean-philippe-martin
Copy link
Contributor Author

jean-philippe-martin commented Aug 23, 2017

Addressed change requests. For some reason I can't compile this in intellij anymore; ./gradlew test running as we speak.

@jean-philippe-martin
Copy link
Contributor Author

$ ./gradlew test
Results: SUCCESS (24183 tests, 24183 successes, 0 failures, 0 skipped)
BUILD SUCCESSFUL

@jean-philippe-martin
Copy link
Contributor Author

@lbergelson it's been a week. I'm going to start rebasing to the new master and resolving conflicts.

Also:
- use jimfs for most CRAM compliance tests
- handle codec filenames starting with "file://"
- add jimfs test for bam/cram writer
- protect against null path
@jean-philippe-martin
Copy link
Contributor Author

jean-philippe-martin commented Aug 31, 2017

Done. Pushing the new code.

$ ./gradlew test
Tests: succeeded 24455, failed 0, canceled 0, ignored 63, pending 0
All tests passed.

@lbergelson
Copy link
Member

@jean-philippe-martin It looks like there was an issue with the rebase:

htsjdk/src/main/java/htsjdk/samtools/SAMFileWriterFactory.java:348: error: ')' expected
                          IOUtil.addExtension(outputPath, ".md5"), samFlagFieldOutput)

Good to merge once that's resolved and test pass!

@jean-philippe-martin
Copy link
Contributor Author

Oh indeed, I forgot to add that fix to the commit. Good on Travis! Fixed now. I don't have a "merge" button here (on GitHub) so @lbergelson I'll let you do the merging once Travis comes up green (unless you want me to do that on the command line? I seem to remember some prohibition against that).

@lbergelson lbergelson merged commit 1ade6c9 into samtools:master Sep 1, 2017
@lbergelson
Copy link
Member

@jean-philippe-martin Your long nightmare is finally over!

@jean-philippe-martin
Copy link
Contributor Author

Champagne! Thank you Louis! We can celebrate the submission of my longest-ever pull request. From April 3 to September 1st, so two days short of 5 months long!

🎉

@jean-philippe-martin jean-philippe-martin deleted the jp_path_for_writers branch September 1, 2017 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants