Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloudify RevertSam #1949

Merged
merged 28 commits into from
Aug 1, 2024
Merged

Cloudify RevertSam #1949

merged 28 commits into from
Aug 1, 2024

Conversation

takutosato
Copy link
Contributor

@takutosato takutosato commented Mar 3, 2024

Description

RevertSam now supports cloud input and output.


Checklist (never delete this)

Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.

Content

  • Added or modified tests to cover changes and any new functionality
  • Edited the README / documentation (if applicable)
  • All tests passing on github actions

Review

  • Final thumbs-up from reviewer
  • Rebase, squash and reword as applicable

For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests

@takutosato
Copy link
Contributor Author

@LeeTL1220 @droazen here it is finally.

Copy link
Contributor

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsato I'm not sure what state this is in, so I started reviewing, but I assume its not ready yet (there are still "tsato" comments, some unused variables, etc.) Let me know if/when its ready for review.

@@ -257,10 +261,12 @@ protected String[] customCommandLineValidation() {

protected int doWork() {
IOUtil.assertFileIsReadable(INPUT.toPath());
ValidationUtil.assertWritable(OUTPUT, OUTPUT_BY_READGROUP);
if (OUTPUT != null && !PicardBucketUtils.isGcsUrl(OUTPUT)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest we add isFileUrl to PicardHstPath, and check for that here, since this won't correctly filter paths on the other (non-GCS) cloud providers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeating my comment from the last round: we should be checking to see if this is a "file://" URL before proceeding, rather than filtering out just gcs, so this will work with non-gcs cloud providers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@takutosato
Copy link
Contributor Author

@cmnbroad all the main code should be ready for review. I've been making fixes to the test, in particular how to properly clean up the test output etc so that the output from a previous test doesn't affect the current test. It's a little more challenging when the output paths are specified in a txt file (OUTPUT_MAP), but I should be able to wrap up by the end of the weekend.

Copy link
Contributor

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round done.

@@ -111,7 +111,7 @@ public Path toPath() {
*/
public static PicardHtsPath fromPath(final Path path){
Objects.requireNonNull(path);
return new PicardHtsPath(new HtsPath(path.toUri().toString()));
return new PicardHtsPath(path.toUri().toString());
Copy link
Contributor

@cmnbroad cmnbroad Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, good catch.

outputMap.put("rg2", new File("/foo/bar/rg2.bam"));
final Map<String, Path> outputMap = new HashMap<>();
outputMap.put("rg1", new File("/foo/bar/rg1.bam").toPath());
outputMap.put("rg2", new File("/foo/bar/rg2.bam").toPath());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and throughout the file, instead of making all of these throw-away File objects, its much cleaner to just do:

Suggested change
outputMap.put("rg2", new File("/foo/bar/rg2.bam").toPath());
outputMap.put("rg2", Paths.get("/foo/bar/rg2.bam"));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, done

@@ -592,4 +597,96 @@ public void testHardClipRoundTrip() throws Exception {
}
}
}

public static final String CURRENT_DIRECTORY = System.getProperty("user.dir");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is unused and can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -592,4 +597,96 @@ public void testHardClipRoundTrip() throws Exception {
}
}
}

public static final String CURRENT_DIRECTORY = System.getProperty("user.dir");
public static final String REVERT_SAM_TEST_DATA_DIR = "testdata/picard/sam/RevertSam/";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used in one place. But it should be at the top of the class, and then it could be used everywhere the string testdata/picard/sam/RevertSam/ appears above, of which there are quite a few.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

public static final String CURRENT_DIRECTORY = System.getProperty("user.dir");
public static final String REVERT_SAM_TEST_DATA_DIR = "testdata/picard/sam/RevertSam/";

final String testSmall = "gs://hellbender/test/resources/picard/bam/CEUTrio.HiSeq.WGS.b37.NA12878.20.21_n100.bam";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed


final String testSmall = "gs://hellbender/test/resources/picard/bam/CEUTrio.HiSeq.WGS.b37.NA12878.20.21_n100.bam";
final String testMedium = "gs://hellbender/test/resources/picard/bam/CEUTrio.HiSeq.WGS.b37.NA12878.20.21_n10000.bam";
final String testMediumCram = "gs://hellbender/test/resources/picard/bam/CEUTrio.HiSeq.WGS.b37.NA12878.20.21_n10000.cram";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and all over the place, since you've made these constants (they are final), they should be static, and have names that are all caps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if all of these file name constants were PicardHtsPaths instead of Strings, I think the downstream test code would be a lot cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, moved these up to PicardCommandLineProgramTest


// The read groups for these readGroupMaps comes from gs://hellbender/test/resources/picard/bam/CEUTrio.HiSeq.WGS.b37.NA12878.20.21_n10000.bam
final String testReadGroupMapFile = "gs://hellbender/test/resources/picard/revertSam/test_group_map_file.txt";
final String testReadGroupMapFileLocal = REVERT_SAM_TEST_DATA_DIR + "test_group_map_file.txt";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than have the cloud one use an "unsuffixed" name (testReadGroupMapFile), give it a suffix that is symmetric with the local one , i.e. testReadGroupMapFileCloud.

{testMedium, Optional.empty(), OUTPUT_BY_READ_GROUP, Optional.of(testReadGroupMapFile)},
{testMedium, Optional.empty(), OUTPUT_BY_READ_GROUP, Optional.of(testReadGroupMapFileLocal)},
{testMediumCram, Optional.empty(), OUTPUT_BY_READ_GROUP, Optional.of(testReadGroupMapFile)},
{testMediumCram, Optional.of(GCloudTestUtils.TEST_OUTPUT_DEFAULT + "test/reverted.cram"), !OUTPUT_BY_READ_GROUP, Optional.empty()},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think using Optional is buying anything here -its intended to be used as a return value to signal callers that there may be no value, but null would be much simpler in these cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@DataProvider(name="cloudTestData")
public Object[][] getCloudTestData() {
return new Object[][] {
{testMedium, Optional.of(GCloudTestUtils.TEST_OUTPUT_DEFAULT + "test/reverted"), OUTPUT_BY_READ_GROUP, Optional.empty()},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no guarantee that you have exclusive access to these cloud files (i.e., GCloudTestUtils.TEST_OUTPUT_DEFAULT + "test/reverted") when these tests run. In fact, you're pretty much guaranteed to not have exclusive access to them, since many CI test runners can be run at the same time, so this is pretty much guaranteed to result in collisions. This may very likely be the reason you're having problems deleting them in your test code below. (Even if I run theses tests locally while a CI job is running, I think bad things will happen). So we need to find a way to make then unique, probably by dynamically creating a unique output directory or something. I'm not sure if we have infrastructure for that, so we may have to invent something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, as I wrote elsewhere, this did create a race condition. See the updated code, which uses temp files with unique random names and does not recycle file names.

@takutosato
Copy link
Contributor Author

@cmnbroad thanks again for reviewing. Code is ready for another round of review.

public static final String TEST_OUTPUT_DEFAULT_STR = TEST_STAGING_DEFAULT_STR + "picard/";
public static final PicardHtsPath TEST_INPUTS_DEFAULT = new PicardHtsPath(TEST_INPUTS_DEFAULT_STR);
public static final PicardHtsPath TEST_STAGING_DEFAULT = new PicardHtsPath(TEST_STAGING_DEFAULT_STR);
public static final PicardHtsPath TEST_PROJECT_DEFAULT = new PicardHtsPath(TEST_PROJECT_DEFAULT_STR);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this value for ? It appears to be a project name, but it gets rendered (via the unused method getTestProject below), as a URI ? GATK doesn't use it that way. I suspect this more mistaken carryover from GATK - I dont think there should ever have been a PicardHtsPath or URI rendering of it. Unless there is some use for it, I would say it should be removed, along with getTestProject below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to remove it. @lbergelson is it ok if we removed TEST_PROJECT_DEFAULTand getTestProject from GCloudTestUtils.java? Looks like you added it about a year ago.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@takutosato @lbergelson Actually, git blames me, as part of the cloudification of references PR.Either way, it looks wrong to me (having the project name might be ok, but the trailing "/" and rendering it as a URI seems totally wrong to me). If Louis confirms that, lets either fix it or remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok I see what you are saying now. I think it's safe to remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I will just remove .getURIString() and the trailing "/"

public static final String TEST_STAGING_DEFAULT_STR = "gs://hellbender-test-logs/staging/";
public static final String TEST_PROJECT_DEFAULT_STR = "broad-dsde-dev/";
public static final String TEST_OUTPUT_DEFAULT_STR = TEST_STAGING_DEFAULT_STR + "picard/";
public static final PicardHtsPath TEST_INPUTS_DEFAULT = new PicardHtsPath(TEST_INPUTS_DEFAULT_STR);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several javadoc references in this file that are broken now - its not clear to me if they're all broken as a result of these changes, but I think some of them are (i.e, {@ value ...} references, but they should be cleaned up.

I don't think you can call a method in the course of an "{@ value....} link, instead I think 'value' needs to be a static primitive (i.e., a String). Alternatively, you can use {@ link...} to provide a navigable link to a method.

if (createOutputMap){
// If these conditions are met, we dynamically create the output map file based on the read groups in inputSAM
final String outputMapDir = outputMapOptions == OutputMapOptions.CLOUD_OUTPUT_MAP ?
DEFAULT_CLOUD_TEST_OUTPUT_DIR.getURIString() :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like like you'll still get file name collisions if more than one PR is running this tests in CI at the same time. Let me know if I'm misunderstanding something here, but I think you need to use a dynamically generated folder name here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see; the name generation is buried below in the output map generation code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was one more case (OUTPUT_MAP = null, OUTPUT=directory, OUTPUT_BY_READ_GROUP=true) that could potentially cause collisions across PRs. I resolved it by appending a timestamp as prefix in those cases.

@@ -21,11 +23,19 @@ public abstract class CommandLineProgramTest {
public static final File CHR_M_REFERENCE = new File(REFERENCE_TEST_DIR,"chrM.reference.fasta");
public static final File CHR_M_DICT = new File(REFERENCE_TEST_DIR,"chrM.reference.dict");

// These are the hg19 references with chromosome names "1" (rather than "chr1")
public static final PicardHtsPath HG19_REFERENCE_GCLOUD = new PicardHtsPath("gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be unsed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@@ -121,6 +126,10 @@ public TabbedTextFileWithHeaderParser(final File file) {
this(new TabbedInputParser(false, file));
}

public TabbedTextFileWithHeaderParser(final Path file) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to avoid adding a new public constructor that propagates exceptions like this (throws IOException), especially given that the existing constructors don't. You've already written several call sites with the appropriate try/catch blocks, so it would be preferable to just create the TabbedInputParser directly within those, and then have them delegate to the existing constructor that takes a TabbedTextFileWithHeaderParser. Then this can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes sense, but I think elsewhere in Picard/htsjdk, we often offer constructors both for File and Path. So it seems less than ideal to have the implicit convention that if you are working with Path, you are expected to use the constructor that takes TabbedInputParser.

What if I put try-catch within TabbedTextFileWithHeaderParser(final Path file), so I can avoid the throws IOException signature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I can't do this because the this statement needs to be the first line of the constructor…

*/
private PicardHtsPath getTestReadGroupMapFile(final PicardHtsPath sam, final String outputBase) {
final String samExtension = sam.isBam() ? ".bam" : ".cram";
final PicardHtsPath result = PicardBucketUtils.getTempFilePath(DEFAULT_CLOUD_TEST_OUTPUT_DIR.getURIString(), ".txt");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having said that, it is super confusing that this method has an outputBase arg, but the master file is always in the cloud, even when the outputBase is not. Is that intentional ? If so I would rename the method to something that indicates that it is cloud-specific. More likely though, the intention is that it should create the .txt file in the output base ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two paths here. One is the ".txt" file (really tsv), where each row is the read group, and the associated path where that read group SAM is written. This file is a temp file, always living in the cloud (by default)

The outputBase arg is the base path for each of the SAM files listed in the .txt file above. This can be cloud or local.

I renamed the variables to make this a little more clear.

* @return The PicardHtsPath to the tsv with each row a pair (read name, output), where output points to a temp file
o * created in the directory specified in outputBase.
*/
private PicardHtsPath getTestReadGroupMapFile(final PicardHtsPath sam, final String outputBase) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If outputBase must be a directory, I would rename it to outputDirectory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed


@DataProvider(name = "cloudTestData")
public Object[][] getCloudTestData() {
final PicardHtsPath tempCramOutputPath = PicardBucketUtils.getTempFilePath(GCloudTestUtils.TEST_OUTPUT_DEFAULT_STR, "RevertSam", ".cram");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a cloud path, I would include that in the name to make it clear, i.e. tempCloudCram or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

if (inputSAM.isCram()){
args.add("REFERENCE_SEQUENCE=" + HG19_CHR2021);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference to use should really be part of the data provider; otherwise its hard to add new test files to this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

*/
@Test(dataProvider = "cloudTestData", groups = "cloud")
public void testCloud(final PicardHtsPath inputSAM, final PicardHtsPath outputPath, final boolean outputByReadGroup,
final OutputMapOptions outputMapOptions) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to figure out why all of these args are needed. It's very possible that I'm missing some nuance, but it seems like it all of this (test and provider) code would be a lot simpler if there were just 3 values in the data provider: the input path, the output path (which will either be a file or a directory), and a writeByReadGroup boolean (which tells you whether the output path needs to be a map (OUTPUT_MAP) or a file (OUTPUT)). Everything else in the test method could be derived from those, and then you don't need the enum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be, but I think it's a bit dangerous to rely on Files.isDirectory(OUTPUT)given how directories are handled in GCS. And even if we did that, I don't think the code would be that much simpler. The reason why this code is hard to parse, I think, is because RevertSam accepts a wide array of input patterns (e.g. OUTPUT may be a file or directory), so I vote that we leave it as is. What do you think?

@takutosato
Copy link
Contributor Author

@cmnbroad sorry this took a while. I've addressed all your comments.

Copy link
Contributor

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking much better - just a couple of minor things left, plus the remote file name collision issue (see inline comments and the engine team slack discussion).

public static final String TEST_PROJECT_DEFAULT_STR = "broad-dsde-dev";
public static final String TEST_OUTPUT_DEFAULT_STR = TEST_STAGING_DEFAULT_STR + "picard/";
public static final PicardHtsPath TEST_INPUTS_DEFAULT = new PicardHtsPath(TEST_INPUTS_DEFAULT_STR);
public static final PicardHtsPath TEST_STAGING_DEFAULT = new PicardHtsPath(TEST_STAGING_DEFAULT_STR);
public static final PicardHtsPath TEST_OUTPUT_DEFAULT = PicardHtsPath.resolve(TEST_STAGING_DEFAULT, "picard/");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a (currently unreferenced) constant TEST_OUTPUT_DEFAULT_STR above. This TEST_OUTPUT_DEFAULT constant should either reference that, or the TEST_OUTPUT_DEFAULT_STR constant should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

public static PicardHtsPath getTempFilePath(final PicardHtsPath directory, String prefix, final String extension){
return getTempFilePath(directory.getURIString(), prefix, extension);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 new getTempFilePath overloads need javadoc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but let me know if you were hoping for a more detailed javadoc

import java.io.IOException;
import java.io.InputStream;
import java.nio.file.Files;
import java.nio.file.Path;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These imports are all unused now, and should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -700,7 +702,7 @@ public void testUnmapBacterialContamination() throws IOException {
MergeBamAlignment mergeBamAlignment = new MergeBamAlignment();

mergeBamAlignment.ALIGNED_BAM = Collections.singletonList(file.toPath().toFile()); // TODO update to use Path when MergeBamAlignment is updated to use Path
mergeBamAlignment.UNMAPPED_BAM = fileUnaligned;
mergeBamAlignment.UNMAPPED_BAM = fileUnaligned.toPath().toFile(); // Also need to be updated to use Path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this file seem to be completely unnecessary. Although I suppose they're harmless,it seems like its. just churn since they don't seem to add anything (maybe there is some previous commit for which they were relevant ?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fileUnaligned is now PicardHtsPath, but MergeBamAlignment hasn't be cloudified yet. I have another PR coming soon for that.

public static final String TEST_STAGING_DEFAULT_STR = "gs://hellbender-test-logs/staging/";
public static final String TEST_PROJECT_DEFAULT_STR = "broad-dsde-dev";
public static final String TEST_OUTPUT_DEFAULT_STR = TEST_STAGING_DEFAULT_STR + "picard/";
public static final PicardHtsPath TEST_INPUTS_DEFAULT = new PicardHtsPath(TEST_INPUTS_DEFAULT_STR);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know some of these were just moved, but in some places, you used a _CLOUD or _GCLOUD in constant names like these, which nicely reinforces that they're remote - I'd suggest doing that for these as.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

public static final PicardHtsPath NA12878_MINI = new PicardHtsPath(GCloudTestUtils.getTestInputPath() + "picard/bam/CEUTrio.HiSeq.WGS.b37.NA12878.20.21_n100.bam");
public static final PicardHtsPath NA12878_MINI_CRAM = new PicardHtsPath(GCloudTestUtils.getTestInputPath() + "picard/bam/CEUTrio.HiSeq.WGS.b37.NA12878.20.21_n100.cram");
public static final PicardHtsPath NA12878_MEDIUM = new PicardHtsPath(GCloudTestUtils.getTestInputPath() + "picard/bam/CEUTrio.HiSeq.WGS.b37.NA12878.20.21_n10000.bam");
public static final PicardHtsPath NA12878_MEDIUM_CRAM = new PicardHtsPath(GCloudTestUtils.getTestInputPath() + "picard/bam/CEUTrio.HiSeq.WGS.b37.NA12878.20.21_n10000.cram");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know some of these names were just moved, but in some places, you used a _CLOUD (or _GCLOUD) in the names of constants like these, which nicely reinforces that they're remote - I'd suggest doing that for these as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

LocalDateTime now = LocalDateTime.now();
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd_HH:mm:ss.SSS");
prefix = now.format(formatter);
args.add("PREFIX=" + prefix);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I think this name collision resolution strategy is a step backwards.

Instead of using DEFAULT_CLOUD_TEST_OUTPUT_DIR for the output directory in these tests, it would be much simpler to change the test code to create a new temporary directory/bucket in the staging dir, and pass that to RevertSam as the outputPath. Then you don't need to add the new tool arg PREFIX, which is really only useful for tests, or this prefix/timestamp code (which is still theoretically subject to name collision).

If there are test cases for which this (or copying the inputs - see the discussion in the engine team space) doesn't suffice for some reason, let me know and we can think about other alternatives. But I'd very much prefer to not add args to tools that are only useful for tests if we can possibly avoid it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that would be the most natural thing to do, except that there are no directories in google cloud so we can't "create" a temp directory…what's confusing is that in the browser, they do let you create a directory (that's how I created DEFAULT_CLOUD_TEST_OUTPUT_DIR). For instance createDirectory method in CloudStorageFileSystemProvider.java is a no-op.

Come to think of it, as long as we don't check that the output directory exists (which we currently do not when the directory is in the cloud), I could probably just create an nonexistent random directory name as OUTPUT, and the files will be created under it. I will do some digging and let you know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsato - you can create a cloud directory (see the engine-team slack discussion from Tuesday 7/9 about this). Also, I'm currently doing a similar thing myself in a different PR so I can share that code with you if you like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if my approach in the new commit is what you had in mind.

final PicardHtsPath tempLocalCram = PicardBucketUtils.getLocalTempFilePath("RevertSam", ".cram");
return new Object[][]{
// Output by read group without the output map, write output bams in the cloud.
{NA12878_MEDIUM, DEFAULT_CLOUD_TEST_OUTPUT_DIR, OUTPUT_BY_READ_GROUP, OutputMapOptions.NO_OUTPUT_MAP, null },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this test case, I think you can just create a temporary parent bucket in the staging area to to hold the output, rather than referencing the default staging bucket here and trying to synthesize unique names using a prefix and timestamp (see my comment where the name synthesis code is).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed this issue by "creating" a "directory" with a randomly generated path. See PicardBucketUtils::getRandomGCSDirectory()

@takutosato
Copy link
Contributor Author

@cmnbroad I addressed your comments, removed the "prefix" with a timestamp approach for avoiding race conditions, and I also removed the enum you suspected earlier was unnecessary. Back to you.

Copy link
Contributor

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - this always takes longer than I think it will. I did a quick pass and there are only a couple of very minor things left. Also, we need to discuss the auto-delete configuration for the temporary directories with Louis. I think we might need to enforce that getRandomGCSDirectory always creates a directory in a well-known specific gcs object that is set to auto-delete, rather than letting the caller create one anywhere. I'll bring this up for discussion with Louis on Monday at the team meeting, so lets hold off on merging until then.

/**
* Creates a path to a "directory" on a Google Cloud System filesystem with a randomly generated URI.
* Since the notion of directories does not exist in GCS, it creates a path to a URI ending in "/".
* Calling this method will not create a directory/file on GCS. It merely returns a path to a non-directory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is super-confusing:

It merely returns a path to a non-directory.

since this method is supposed to be creating a directory. I would suggest maybe saying "it creates an object that can be used as a parent container for other objects" or something similar instead.

Copy link
Contributor Author

@takutosato takutosato Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I rewrote this to make it clear

// For local temp file, directory should be null.
/**
* This overload of getTempFilePath takes the directory of type PicardHtsPath instead of String.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well use the standard javadoc format like you did elsewhere, i.e. list the params like this (this is a random example I took from elsewhere):

/**

  • Picks a random name, by putting some random letters between "prefix" and "suffix".
  • @param stagingLocation The folder where you want the file to be. Must start with "gs://" or "hdfs://"
  • @param prefix The beginning of the file name
  • @param suffix The end of the file name, e.g. ".tmp"
    */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right; I thought it might be a little verbose to repeat the almost identical javadoc for overladed methods. It seems like using @see to refer to the original with the full java may be a good solution?

https://stackoverflow.com/questions/2558695/javadoc-reuse-for-overloaded-methods

Copy link
Contributor

@cmnbroad cmnbroad Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the see notation seems fine, but please keep in mind that the ultimate goal here is to eliminate these string overloads, since they're the ones that cause problems. (In fact, I would lobby that the next cloudification PR should do exactly that - do nothing except clean up and get rid of all of these methods that take a string path or dir before we accumulate so many usages of them that removing them becomes untenable).

So given that, if we're going to use "see" references like this, it would be preferable to have the public static PicardHtsPath getTempFilePath(final PicardHtsPath directory, String prefix, final String extension) be the one with specific javadoc, and have the other ones see/refer to that one. Then when we finally remove the string overloads ones, we won't have to change javadoc anywhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, what we want to avoid is directly manipulating the underlying URI String of an existing PicardHtsPath object to create a new one. Given that getTempFilePath is a method that creates a new PicardHtsPath from scratch, rather than from an existing PicardHtsPath, I think String input can be the main method signature (much like the constructor for PicardHtsPath itself).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't disagree more. The PicardHtsPath constructor is REQUIRED to take a string arg because thats what the user gives us on the command line, and it's what the CLP/Barclay requires. getTempFilePath(String, String, String), on the other hand, is exactly the kind of method that that I think we should seek to get rid of because it FORCES the caller to convert to a string. The WHOLE purpose of HtsPath was to eliminate the problems caused by those conversions. Anyway, I've made these arguments ad nauseum.

outputMap = readOutputMap(OUTPUT_MAP.toPath());
} else {
outputMap = createOutputMapFromReadGroups(inHeader.getReadGroups(), OUTPUT.toPath(), defaultExtension);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much better naming, thanks. This tool is super confusing - sometimes OUTPUT_MAP is an output and sometimes its an input.


/**
* Calls getTempFilePath without the prefix.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about standard javadoc format with params.

ValidationUtils.validateArg(parentDir.getURIString().endsWith("/"), "parentDir must end in backslash '/': " + parentDir.getURIString());
return PicardHtsPath.fromPath(PicardBucketUtils.randomRemotePath(parentDir.getURIString(), "", "/"));
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we agreed in our discussion yesterday that this method should (and I thought you said that it already did) constrain directory creation to the well-known staging bucket location that has a TTL set. As far as I can tell, it lets you create a directory anywhere.

I know the analogous method in GATK also allows you to create one anywhere as well, but that is how we accumulated the 10,000+ old temporary files floating around in GCS that date back to 2015. If we don't constrain this one, we'll replicate that same problem with Picard. Is there some reason we can't constrain this to the Picard staging location ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now creates a file under a hardcoded staging location "gs://hellbender-test-logs/staging/picard/"

// For local temp file, directory should be null.
/**
* This overload of getTempFilePath takes the directory of type PicardHtsPath instead of String.
*/
Copy link
Contributor

@cmnbroad cmnbroad Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the see notation seems fine, but please keep in mind that the ultimate goal here is to eliminate these string overloads, since they're the ones that cause problems. (In fact, I would lobby that the next cloudification PR should do exactly that - do nothing except clean up and get rid of all of these methods that take a string path or dir before we accumulate so many usages of them that removing them becomes untenable).

So given that, if we're going to use "see" references like this, it would be preferable to have the public static PicardHtsPath getTempFilePath(final PicardHtsPath directory, String prefix, final String extension) be the one with specific javadoc, and have the other ones see/refer to that one. Then when we finally remove the string overloads ones, we won't have to change javadoc anywhere else.

public static PicardHtsPath getRandomGCSDirectory(final String relativePath){
ValidationUtils.validateArg(relativePath.endsWith("/"), "relativePath must end in backslash '/': " + relativePath);
// This Picard test staging bucket has a TTL of 180 days (DeleteAction with Age = 180)
final String stagingDirectory = "gs://hellbender-test-logs/staging/picard/";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a constant defining the staging area in GCloudTestUtils (TEST_STAGING_DEFAULT_GCLOUD_STR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't access the variable because it's in a test directory. Defined a new constant in this class.

public static PicardHtsPath getTempFilePath(String directory, String extension){
return getTempFilePath(directory, "", extension);
}

/**
*
* Creates a temporary file on the local temporary file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nonsensical.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

// For local temp file, directory should be null.
/**
* This overload of getTempFilePath takes the directory of type PicardHtsPath instead of String.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't disagree more. The PicardHtsPath constructor is REQUIRED to take a string arg because thats what the user gives us on the command line, and it's what the CLP/Barclay requires. getTempFilePath(String, String, String), on the other hand, is exactly the kind of method that that I think we should seek to get rid of because it FORCES the caller to convert to a string. The WHOLE purpose of HtsPath was to eliminate the problems caused by those conversions. Anyway, I've made these arguments ad nauseum.

@takutosato takutosato merged commit 77f4b09 into master Aug 1, 2024
9 checks passed
@takutosato takutosato deleted the ts_revertsam_lee branch August 1, 2024 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants