-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloudify RevertSam #1949
Cloudify RevertSam #1949
Conversation
@LeeTL1220 @droazen here it is finally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tsato I'm not sure what state this is in, so I started reviewing, but I assume its not ready yet (there are still "tsato" comments, some unused variables, etc.) Let me know if/when its ready for review.
@@ -257,10 +261,12 @@ protected String[] customCommandLineValidation() { | |||
|
|||
protected int doWork() { | |||
IOUtil.assertFileIsReadable(INPUT.toPath()); | |||
ValidationUtil.assertWritable(OUTPUT, OUTPUT_BY_READGROUP); | |||
if (OUTPUT != null && !PicardBucketUtils.isGcsUrl(OUTPUT)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest we add isFileUrl
to PicardHstPath, and check for that here, since this won't correctly filter paths on the other (non-GCS) cloud providers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeating my comment from the last round: we should be checking to see if this is a "file://" URL before proceeding, rather than filtering out just gcs, so this will work with non-gcs cloud providers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@cmnbroad all the main code should be ready for review. I've been making fixes to the test, in particular how to properly clean up the test output etc so that the output from a previous test doesn't affect the current test. It's a little more challenging when the output paths are specified in a txt file ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round done.
@@ -111,7 +111,7 @@ public Path toPath() { | |||
*/ | |||
public static PicardHtsPath fromPath(final Path path){ | |||
Objects.requireNonNull(path); | |||
return new PicardHtsPath(new HtsPath(path.toUri().toString())); | |||
return new PicardHtsPath(path.toUri().toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, good catch.
outputMap.put("rg2", new File("/foo/bar/rg2.bam")); | ||
final Map<String, Path> outputMap = new HashMap<>(); | ||
outputMap.put("rg1", new File("/foo/bar/rg1.bam").toPath()); | ||
outputMap.put("rg2", new File("/foo/bar/rg2.bam").toPath()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and throughout the file, instead of making all of these throw-away File
objects, its much cleaner to just do:
outputMap.put("rg2", new File("/foo/bar/rg2.bam").toPath()); | |
outputMap.put("rg2", Paths.get("/foo/bar/rg2.bam")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, done
@@ -592,4 +597,96 @@ public void testHardClipRoundTrip() throws Exception { | |||
} | |||
} | |||
} | |||
|
|||
public static final String CURRENT_DIRECTORY = System.getProperty("user.dir"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable is unused and can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -592,4 +597,96 @@ public void testHardClipRoundTrip() throws Exception { | |||
} | |||
} | |||
} | |||
|
|||
public static final String CURRENT_DIRECTORY = System.getProperty("user.dir"); | |||
public static final String REVERT_SAM_TEST_DATA_DIR = "testdata/picard/sam/RevertSam/"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used in one place. But it should be at the top of the class, and then it could be used everywhere the string testdata/picard/sam/RevertSam/
appears above, of which there are quite a few.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
public static final String CURRENT_DIRECTORY = System.getProperty("user.dir"); | ||
public static final String REVERT_SAM_TEST_DATA_DIR = "testdata/picard/sam/RevertSam/"; | ||
|
||
final String testSmall = "gs://hellbender/test/resources/picard/bam/CEUTrio.HiSeq.WGS.b37.NA12878.20.21_n100.bam"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable is unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
|
||
final String testSmall = "gs://hellbender/test/resources/picard/bam/CEUTrio.HiSeq.WGS.b37.NA12878.20.21_n100.bam"; | ||
final String testMedium = "gs://hellbender/test/resources/picard/bam/CEUTrio.HiSeq.WGS.b37.NA12878.20.21_n10000.bam"; | ||
final String testMediumCram = "gs://hellbender/test/resources/picard/bam/CEUTrio.HiSeq.WGS.b37.NA12878.20.21_n10000.cram"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and all over the place, since you've made these constants (they are final), they should be static, and have names that are all caps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if all of these file name constants were PicardHtsPaths instead of Strings, I think the downstream test code would be a lot cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, moved these up to PicardCommandLineProgramTest
|
||
// The read groups for these readGroupMaps comes from gs://hellbender/test/resources/picard/bam/CEUTrio.HiSeq.WGS.b37.NA12878.20.21_n10000.bam | ||
final String testReadGroupMapFile = "gs://hellbender/test/resources/picard/revertSam/test_group_map_file.txt"; | ||
final String testReadGroupMapFileLocal = REVERT_SAM_TEST_DATA_DIR + "test_group_map_file.txt"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than have the cloud one use an "unsuffixed" name (testReadGroupMapFile), give it a suffix that is symmetric with the local one , i.e. testReadGroupMapFileCloud
.
{testMedium, Optional.empty(), OUTPUT_BY_READ_GROUP, Optional.of(testReadGroupMapFile)}, | ||
{testMedium, Optional.empty(), OUTPUT_BY_READ_GROUP, Optional.of(testReadGroupMapFileLocal)}, | ||
{testMediumCram, Optional.empty(), OUTPUT_BY_READ_GROUP, Optional.of(testReadGroupMapFile)}, | ||
{testMediumCram, Optional.of(GCloudTestUtils.TEST_OUTPUT_DEFAULT + "test/reverted.cram"), !OUTPUT_BY_READ_GROUP, Optional.empty()}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think using Optional
is buying anything here -its intended to be used as a return value to signal callers that there may be no value, but null would be much simpler in these cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@DataProvider(name="cloudTestData") | ||
public Object[][] getCloudTestData() { | ||
return new Object[][] { | ||
{testMedium, Optional.of(GCloudTestUtils.TEST_OUTPUT_DEFAULT + "test/reverted"), OUTPUT_BY_READ_GROUP, Optional.empty()}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no guarantee that you have exclusive access to these cloud files (i.e., GCloudTestUtils.TEST_OUTPUT_DEFAULT + "test/reverted"
) when these tests run. In fact, you're pretty much guaranteed to not have exclusive access to them, since many CI test runners can be run at the same time, so this is pretty much guaranteed to result in collisions. This may very likely be the reason you're having problems deleting them in your test code below. (Even if I run theses tests locally while a CI job is running, I think bad things will happen). So we need to find a way to make then unique, probably by dynamically creating a unique output directory or something. I'm not sure if we have infrastructure for that, so we may have to invent something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, as I wrote elsewhere, this did create a race condition. See the updated code, which uses temp files with unique random names and does not recycle file names.
@cmnbroad thanks again for reviewing. Code is ready for another round of review. |
public static final String TEST_OUTPUT_DEFAULT_STR = TEST_STAGING_DEFAULT_STR + "picard/"; | ||
public static final PicardHtsPath TEST_INPUTS_DEFAULT = new PicardHtsPath(TEST_INPUTS_DEFAULT_STR); | ||
public static final PicardHtsPath TEST_STAGING_DEFAULT = new PicardHtsPath(TEST_STAGING_DEFAULT_STR); | ||
public static final PicardHtsPath TEST_PROJECT_DEFAULT = new PicardHtsPath(TEST_PROJECT_DEFAULT_STR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this value for ? It appears to be a project name, but it gets rendered (via the unused method getTestProject
below), as a URI ? GATK doesn't use it that way. I suspect this more mistaken carryover from GATK - I dont think there should ever have been a PicardHtsPath or URI rendering of it. Unless there is some use for it, I would say it should be removed, along with getTestProject
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy to remove it. @lbergelson is it ok if we removed TEST_PROJECT_DEFAULT
and getTestProject
from GCloudTestUtils.java
? Looks like you added it about a year ago.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@takutosato @lbergelson Actually, git blames me, as part of the cloudification of references PR.Either way, it looks wrong to me (having the project name might be ok, but the trailing "/" and rendering it as a URI seems totally wrong to me). If Louis confirms that, lets either fix it or remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok I see what you are saying now. I think it's safe to remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I will just remove .getURIString() and the trailing "/"
public static final String TEST_STAGING_DEFAULT_STR = "gs://hellbender-test-logs/staging/"; | ||
public static final String TEST_PROJECT_DEFAULT_STR = "broad-dsde-dev/"; | ||
public static final String TEST_OUTPUT_DEFAULT_STR = TEST_STAGING_DEFAULT_STR + "picard/"; | ||
public static final PicardHtsPath TEST_INPUTS_DEFAULT = new PicardHtsPath(TEST_INPUTS_DEFAULT_STR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several javadoc references in this file that are broken now - its not clear to me if they're all broken as a result of these changes, but I think some of them are (i.e, {@ value ...} references, but they should be cleaned up.
I don't think you can call a method in the course of an "{@ value....} link, instead I think 'value' needs to be a static primitive (i.e., a String). Alternatively, you can use {@ link...} to provide a navigable link to a method.
if (createOutputMap){ | ||
// If these conditions are met, we dynamically create the output map file based on the read groups in inputSAM | ||
final String outputMapDir = outputMapOptions == OutputMapOptions.CLOUD_OUTPUT_MAP ? | ||
DEFAULT_CLOUD_TEST_OUTPUT_DIR.getURIString() : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me like like you'll still get file name collisions if more than one PR is running this tests in CI at the same time. Let me know if I'm misunderstanding something here, but I think you need to use a dynamically generated folder name here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see; the name generation is buried below in the output map generation code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was one more case (OUTPUT_MAP = null, OUTPUT=directory, OUTPUT_BY_READ_GROUP=true) that could potentially cause collisions across PRs. I resolved it by appending a timestamp as prefix in those cases.
@@ -21,11 +23,19 @@ public abstract class CommandLineProgramTest { | |||
public static final File CHR_M_REFERENCE = new File(REFERENCE_TEST_DIR,"chrM.reference.fasta"); | |||
public static final File CHR_M_DICT = new File(REFERENCE_TEST_DIR,"chrM.reference.dict"); | |||
|
|||
// These are the hg19 references with chromosome names "1" (rather than "chr1") | |||
public static final PicardHtsPath HG19_REFERENCE_GCLOUD = new PicardHtsPath("gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be unsed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
@@ -121,6 +126,10 @@ public TabbedTextFileWithHeaderParser(final File file) { | |||
this(new TabbedInputParser(false, file)); | |||
} | |||
|
|||
public TabbedTextFileWithHeaderParser(final Path file) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try to avoid adding a new public constructor that propagates exceptions like this (throws IOException
), especially given that the existing constructors don't. You've already written several call sites with the appropriate try/catch blocks, so it would be preferable to just create the TabbedInputParser
directly within those, and then have them delegate to the existing constructor that takes a TabbedTextFileWithHeaderParser
. Then this can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that makes sense, but I think elsewhere in Picard/htsjdk, we often offer constructors both for File
and Path
. So it seems less than ideal to have the implicit convention that if you are working with Path, you are expected to use the constructor that takes TabbedInputParser.
What if I put try-catch within TabbedTextFileWithHeaderParser(final Path file)
, so I can avoid the throws IOException
signature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I can't do this because the this
statement needs to be the first line of the constructor…
*/ | ||
private PicardHtsPath getTestReadGroupMapFile(final PicardHtsPath sam, final String outputBase) { | ||
final String samExtension = sam.isBam() ? ".bam" : ".cram"; | ||
final PicardHtsPath result = PicardBucketUtils.getTempFilePath(DEFAULT_CLOUD_TEST_OUTPUT_DIR.getURIString(), ".txt"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having said that, it is super confusing that this method has an outputBase arg, but the master file is always in the cloud, even when the outputBase is not. Is that intentional ? If so I would rename the method to something that indicates that it is cloud-specific. More likely though, the intention is that it should create the .txt file in the output base ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two paths here. One is the ".txt" file (really tsv), where each row is the read group, and the associated path where that read group SAM is written. This file is a temp file, always living in the cloud (by default)
The outputBase arg is the base path for each of the SAM files listed in the .txt file above. This can be cloud or local.
I renamed the variables to make this a little more clear.
* @return The PicardHtsPath to the tsv with each row a pair (read name, output), where output points to a temp file | ||
o * created in the directory specified in outputBase. | ||
*/ | ||
private PicardHtsPath getTestReadGroupMapFile(final PicardHtsPath sam, final String outputBase) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If outputBase must be a directory, I would rename it to outputDirectory
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed
|
||
@DataProvider(name = "cloudTestData") | ||
public Object[][] getCloudTestData() { | ||
final PicardHtsPath tempCramOutputPath = PicardBucketUtils.getTempFilePath(GCloudTestUtils.TEST_OUTPUT_DEFAULT_STR, "RevertSam", ".cram"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a cloud path, I would include that in the name to make it clear, i.e. tempCloudCram or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
if (inputSAM.isCram()){ | ||
args.add("REFERENCE_SEQUENCE=" + HG19_CHR2021); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reference to use should really be part of the data provider; otherwise its hard to add new test files to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
*/ | ||
@Test(dataProvider = "cloudTestData", groups = "cloud") | ||
public void testCloud(final PicardHtsPath inputSAM, final PicardHtsPath outputPath, final boolean outputByReadGroup, | ||
final OutputMapOptions outputMapOptions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to figure out why all of these args are needed. It's very possible that I'm missing some nuance, but it seems like it all of this (test and provider) code would be a lot simpler if there were just 3 values in the data provider: the input path, the output path (which will either be a file or a directory), and a writeByReadGroup
boolean (which tells you whether the output path needs to be a map (OUTPUT_MAP) or a file (OUTPUT)). Everything else in the test method could be derived from those, and then you don't need the enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be, but I think it's a bit dangerous to rely on Files.isDirectory(OUTPUT)
given how directories are handled in GCS. And even if we did that, I don't think the code would be that much simpler. The reason why this code is hard to parse, I think, is because RevertSam accepts a wide array of input patterns (e.g. OUTPUT may be a file or directory), so I vote that we leave it as is. What do you think?
@cmnbroad sorry this took a while. I've addressed all your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking much better - just a couple of minor things left, plus the remote file name collision issue (see inline comments and the engine team slack discussion).
public static final String TEST_PROJECT_DEFAULT_STR = "broad-dsde-dev"; | ||
public static final String TEST_OUTPUT_DEFAULT_STR = TEST_STAGING_DEFAULT_STR + "picard/"; | ||
public static final PicardHtsPath TEST_INPUTS_DEFAULT = new PicardHtsPath(TEST_INPUTS_DEFAULT_STR); | ||
public static final PicardHtsPath TEST_STAGING_DEFAULT = new PicardHtsPath(TEST_STAGING_DEFAULT_STR); | ||
public static final PicardHtsPath TEST_OUTPUT_DEFAULT = PicardHtsPath.resolve(TEST_STAGING_DEFAULT, "picard/"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a (currently unreferenced) constant TEST_OUTPUT_DEFAULT_STR
above. This TEST_OUTPUT_DEFAULT
constant should either reference that, or the TEST_OUTPUT_DEFAULT_STR
constant should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
public static PicardHtsPath getTempFilePath(final PicardHtsPath directory, String prefix, final String extension){ | ||
return getTempFilePath(directory.getURIString(), prefix, extension); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 2 new getTempFilePath overloads need javadoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but let me know if you were hoping for a more detailed javadoc
import java.io.IOException; | ||
import java.io.InputStream; | ||
import java.nio.file.Files; | ||
import java.nio.file.Path; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These imports are all unused now, and should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -700,7 +702,7 @@ public void testUnmapBacterialContamination() throws IOException { | |||
MergeBamAlignment mergeBamAlignment = new MergeBamAlignment(); | |||
|
|||
mergeBamAlignment.ALIGNED_BAM = Collections.singletonList(file.toPath().toFile()); // TODO update to use Path when MergeBamAlignment is updated to use Path | |||
mergeBamAlignment.UNMAPPED_BAM = fileUnaligned; | |||
mergeBamAlignment.UNMAPPED_BAM = fileUnaligned.toPath().toFile(); // Also need to be updated to use Path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in this file seem to be completely unnecessary. Although I suppose they're harmless,it seems like its. just churn since they don't seem to add anything (maybe there is some previous commit for which they were relevant ?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fileUnaligned
is now PicardHtsPath
, but MergeBamAlignment hasn't be cloudified yet. I have another PR coming soon for that.
public static final String TEST_STAGING_DEFAULT_STR = "gs://hellbender-test-logs/staging/"; | ||
public static final String TEST_PROJECT_DEFAULT_STR = "broad-dsde-dev"; | ||
public static final String TEST_OUTPUT_DEFAULT_STR = TEST_STAGING_DEFAULT_STR + "picard/"; | ||
public static final PicardHtsPath TEST_INPUTS_DEFAULT = new PicardHtsPath(TEST_INPUTS_DEFAULT_STR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know some of these were just moved, but in some places, you used a _CLOUD
or _GCLOUD
in constant names like these, which nicely reinforces that they're remote - I'd suggest doing that for these as.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
public static final PicardHtsPath NA12878_MINI = new PicardHtsPath(GCloudTestUtils.getTestInputPath() + "picard/bam/CEUTrio.HiSeq.WGS.b37.NA12878.20.21_n100.bam"); | ||
public static final PicardHtsPath NA12878_MINI_CRAM = new PicardHtsPath(GCloudTestUtils.getTestInputPath() + "picard/bam/CEUTrio.HiSeq.WGS.b37.NA12878.20.21_n100.cram"); | ||
public static final PicardHtsPath NA12878_MEDIUM = new PicardHtsPath(GCloudTestUtils.getTestInputPath() + "picard/bam/CEUTrio.HiSeq.WGS.b37.NA12878.20.21_n10000.bam"); | ||
public static final PicardHtsPath NA12878_MEDIUM_CRAM = new PicardHtsPath(GCloudTestUtils.getTestInputPath() + "picard/bam/CEUTrio.HiSeq.WGS.b37.NA12878.20.21_n10000.cram"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know some of these names were just moved, but in some places, you used a _CLOUD
(or _GCLOUD
) in the names of constants like these, which nicely reinforces that they're remote - I'd suggest doing that for these as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
LocalDateTime now = LocalDateTime.now(); | ||
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd_HH:mm:ss.SSS"); | ||
prefix = now.format(formatter); | ||
args.add("PREFIX=" + prefix); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I think this name collision resolution strategy is a step backwards.
Instead of using DEFAULT_CLOUD_TEST_OUTPUT_DIR for the output directory in these tests, it would be much simpler to change the test code to create a new temporary directory/bucket in the staging dir, and pass that to RevertSam as the outputPath. Then you don't need to add the new tool arg PREFIX, which is really only useful for tests, or this prefix/timestamp code (which is still theoretically subject to name collision).
If there are test cases for which this (or copying the inputs - see the discussion in the engine team space) doesn't suffice for some reason, let me know and we can think about other alternatives. But I'd very much prefer to not add args to tools that are only useful for tests if we can possibly avoid it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that would be the most natural thing to do, except that there are no directories in google cloud so we can't "create" a temp directory…what's confusing is that in the browser, they do let you create a directory (that's how I created DEFAULT_CLOUD_TEST_OUTPUT_DIR). For instance createDirectory
method in CloudStorageFileSystemProvider.java
is a no-op.
Come to think of it, as long as we don't check that the output directory exists (which we currently do not when the directory is in the cloud), I could probably just create an nonexistent random directory name as OUTPUT, and the files will be created under it. I will do some digging and let you know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tsato - you can create a cloud directory (see the engine-team slack discussion from Tuesday 7/9 about this). Also, I'm currently doing a similar thing myself in a different PR so I can share that code with you if you like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if my approach in the new commit is what you had in mind.
final PicardHtsPath tempLocalCram = PicardBucketUtils.getLocalTempFilePath("RevertSam", ".cram"); | ||
return new Object[][]{ | ||
// Output by read group without the output map, write output bams in the cloud. | ||
{NA12878_MEDIUM, DEFAULT_CLOUD_TEST_OUTPUT_DIR, OUTPUT_BY_READ_GROUP, OutputMapOptions.NO_OUTPUT_MAP, null }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this test case, I think you can just create a temporary parent bucket in the staging area to to hold the output, rather than referencing the default staging bucket here and trying to synthesize unique names using a prefix and timestamp (see my comment where the name synthesis code is).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed this issue by "creating" a "directory" with a randomly generated path. See PicardBucketUtils::getRandomGCSDirectory()
@cmnbroad I addressed your comments, removed the "prefix" with a timestamp approach for avoiding race conditions, and I also removed the enum you suspected earlier was unnecessary. Back to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - this always takes longer than I think it will. I did a quick pass and there are only a couple of very minor things left. Also, we need to discuss the auto-delete configuration for the temporary directories with Louis. I think we might need to enforce that getRandomGCSDirectory
always creates a directory in a well-known specific gcs object that is set to auto-delete, rather than letting the caller create one anywhere. I'll bring this up for discussion with Louis on Monday at the team meeting, so lets hold off on merging until then.
/** | ||
* Creates a path to a "directory" on a Google Cloud System filesystem with a randomly generated URI. | ||
* Since the notion of directories does not exist in GCS, it creates a path to a URI ending in "/". | ||
* Calling this method will not create a directory/file on GCS. It merely returns a path to a non-directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is super-confusing:
It merely returns a path to a non-directory.
since this method is supposed to be creating a directory. I would suggest maybe saying "it creates an object that can be used as a parent container for other objects" or something similar instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I rewrote this to make it clear
// For local temp file, directory should be null. | ||
/** | ||
* This overload of getTempFilePath takes the directory of type PicardHtsPath instead of String. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well use the standard javadoc format like you did elsewhere, i.e. list the params like this (this is a random example I took from elsewhere):
/**
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right; I thought it might be a little verbose to repeat the almost identical javadoc for overladed methods. It seems like using @see
to refer to the original with the full java may be a good solution?
https://stackoverflow.com/questions/2558695/javadoc-reuse-for-overloaded-methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the see
notation seems fine, but please keep in mind that the ultimate goal here is to eliminate these string overloads, since they're the ones that cause problems. (In fact, I would lobby that the next cloudification PR should do exactly that - do nothing except clean up and get rid of all of these methods that take a string path or dir before we accumulate so many usages of them that removing them becomes untenable).
So given that, if we're going to use "see" references like this, it would be preferable to have the public static PicardHtsPath getTempFilePath(final PicardHtsPath directory, String prefix, final String extension)
be the one with specific javadoc, and have the other ones see
/refer to that one. Then when we finally remove the string overloads ones, we won't have to change javadoc anywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not mistaken, what we want to avoid is directly manipulating the underlying URI String of an existing PicardHtsPath object to create a new one. Given that getTempFilePath is a method that creates a new PicardHtsPath from scratch, rather than from an existing PicardHtsPath, I think String input can be the main method signature (much like the constructor for PicardHtsPath itself).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't disagree more. The PicardHtsPath
constructor is REQUIRED to take a string arg because thats what the user gives us on the command line, and it's what the CLP/Barclay requires. getTempFilePath(String, String, String)
, on the other hand, is exactly the kind of method that that I think we should seek to get rid of because it FORCES the caller to convert to a string. The WHOLE purpose of HtsPath was to eliminate the problems caused by those conversions. Anyway, I've made these arguments ad nauseum.
outputMap = readOutputMap(OUTPUT_MAP.toPath()); | ||
} else { | ||
outputMap = createOutputMapFromReadGroups(inHeader.getReadGroups(), OUTPUT.toPath(), defaultExtension); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much better naming, thanks. This tool is super confusing - sometimes OUTPUT_MAP is an output and sometimes its an input.
|
||
/** | ||
* Calls getTempFilePath without the prefix. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about standard javadoc format with params.
ValidationUtils.validateArg(parentDir.getURIString().endsWith("/"), "parentDir must end in backslash '/': " + parentDir.getURIString()); | ||
return PicardHtsPath.fromPath(PicardBucketUtils.randomRemotePath(parentDir.getURIString(), "", "/")); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we agreed in our discussion yesterday that this method should (and I thought you said that it already did) constrain directory creation to the well-known staging bucket location that has a TTL set. As far as I can tell, it lets you create a directory anywhere.
I know the analogous method in GATK also allows you to create one anywhere as well, but that is how we accumulated the 10,000+ old temporary files floating around in GCS that date back to 2015. If we don't constrain this one, we'll replicate that same problem with Picard. Is there some reason we can't constrain this to the Picard staging location ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now creates a file under a hardcoded staging location "gs://hellbender-test-logs/staging/picard/"
// For local temp file, directory should be null. | ||
/** | ||
* This overload of getTempFilePath takes the directory of type PicardHtsPath instead of String. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the see
notation seems fine, but please keep in mind that the ultimate goal here is to eliminate these string overloads, since they're the ones that cause problems. (In fact, I would lobby that the next cloudification PR should do exactly that - do nothing except clean up and get rid of all of these methods that take a string path or dir before we accumulate so many usages of them that removing them becomes untenable).
So given that, if we're going to use "see" references like this, it would be preferable to have the public static PicardHtsPath getTempFilePath(final PicardHtsPath directory, String prefix, final String extension)
be the one with specific javadoc, and have the other ones see
/refer to that one. Then when we finally remove the string overloads ones, we won't have to change javadoc anywhere else.
e3aebe5
to
9f2b757
Compare
public static PicardHtsPath getRandomGCSDirectory(final String relativePath){ | ||
ValidationUtils.validateArg(relativePath.endsWith("/"), "relativePath must end in backslash '/': " + relativePath); | ||
// This Picard test staging bucket has a TTL of 180 days (DeleteAction with Age = 180) | ||
final String stagingDirectory = "gs://hellbender-test-logs/staging/picard/"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a constant defining the staging area in GCloudTestUtils (TEST_STAGING_DEFAULT_GCLOUD_STR
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't access the variable because it's in a test directory. Defined a new constant in this class.
public static PicardHtsPath getTempFilePath(String directory, String extension){ | ||
return getTempFilePath(directory, "", extension); | ||
} | ||
|
||
/** | ||
* | ||
* Creates a temporary file on the local temporary file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nonsensical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
// For local temp file, directory should be null. | ||
/** | ||
* This overload of getTempFilePath takes the directory of type PicardHtsPath instead of String. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't disagree more. The PicardHtsPath
constructor is REQUIRED to take a string arg because thats what the user gives us on the command line, and it's what the CLP/Barclay requires. getTempFilePath(String, String, String)
, on the other hand, is exactly the kind of method that that I think we should seek to get rid of because it FORCES the caller to convert to a string. The WHOLE purpose of HtsPath was to eliminate the problems caused by those conversions. Anyway, I've made these arguments ad nauseum.
Description
RevertSam now supports cloud input and output.
Checklist (never delete this)
Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.
Content
Review
For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests