-
Notifications
You must be signed in to change notification settings - Fork 648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finalize workflow output definition #5103
Comments
Great suggestions Ben. Do you expect that there will be any changes to the observable events in this new channel-focused model? A common use-case for plugins is to register the outputs with a third-party service (LIMS, for example). It would be helpful if plugins could hook into a publish event that has both the source, destination, and other channel objects (i.e. If we are now "publishing channels", it would be helpful to have a way to include everything, including non-file data in that publishing event. |
Great to see this coming together after previous discussions Ben! Really like the The distinction between 'output' and 'publishing' never sat well with me, it made much more sense to me if everything was an 'output', with things marked temporary/ retained as appropriate. Plus it is a nightmare searching across multiple config files for the publishing logic for a specific set of files (see the current dev branch of nf-core/rnaseq). So I really like all the places where you're removing publishing logic and it makes a LOT of sense to me that the publishing/ retaining part happens only at the outermost layer- I hope you keep that. Really like having entry workflows as well, having all components be more easily runnable without as much surrounding boilerplate would be awesome. Would also be interested to hear more on the non-file output aspects, speaking to Rob's point, and how it could assist with daisy chaining workflows, but overall love the way this is going! |
Points I'd like to discuss:
|
From @Midnighter in #5130 :
|
Thanks @Midnighter, I have basically reached the same conclusions. After discussing with Paolo, I think we will encourage |
Hey! If I can add something from my wishlist: it would be great to get logging on what dynamic output paths like |
Any news on when the dynamic path mapping will be in a nextflow edge release? I'm currently trying to convert my file publishing to the new output definitions but need this feature to be able to copy the old way |
@nvnieuwk Still in progress. I have an implementation in the linked PR but it needs to be reviewed and tested. You're welcome to try it out if you want. I doubt it will make the next edge release, maybe 24.09.0-edge. |
I'd like to get some opinions on some of the nuances of the dynamic path mapping. If I have something like this: workflow {
main:
ch_fastq = Channel.of( [ [id: 1], [ file('foo.fastq') ] ] )
publish:
ch_fastq >> 'fastq'
}
output {
directory 'results'
'fastq' {
path { /* ... */ }
}
} I see a few different ways to design the
I like (2) because it's simpler and retains the meaning of the Option (1) is essentially the same as Let me know what you think. This is the essential question we need to resolve in order to move this feature forward. |
I personally like option 1 because that way I don't have to make sure the files emitted by processes have the correct name as I will set the correct output name here. It's also a nice option to be able to specify some kind of optional nesting which option 1 would enable. |
During my testing I also noticed a weird error. So when an input file that hasn't been staged is present in the channel specified in the
It started working again if I filtered out these paths |
Two more syntax ideas from @robsyme and @pinin4fjords respectively
|
Thanks @bentsherman! I like 4) because, while it forces me to to a transpose from time to time, I know that if I make my channels fit a predefined schema, I'm good. I guess we'd probably just end up doing a lot of:
? (Although, if it's a case of doing that all the time, maybe there could be something implicit somewhere to do that?) In any case, predictable structure of publishable channels means that all my To take this to the nth degree, if things were predictable enough, maybe there could be some default behaviour so we didn't even need
... most of the time. There could be a default way in which files were structured in the publish directories based on metadata properties. You'd only need the closure if you wanted to depart from those defaults. |
What if, for people that wanted an index file, the pattern 3 could be extended to include something close to what @pinin4fjords is suggesting. The user could return two elements, where the first element is the metadata and the second element is the destination filename: path { meta, fastqs ->
{ file ->
destination = "${meta.id}/${file.baseName}"
[meta.subMap('id', 'patient'), destination]
}
} The index could simply use the first element (if provided) to populate the columns in the csv/tsv/whatever. This would remove the necessity for the channel transforms in the workflow. If the transforms are always going to be simple unwrapping/transposing, I think this approach would be tidier. If we expect transforms to be more involved, then this proposal would not be suitable. |
The samplesheet produced by fetchngs contains a row for each sample, where each sample has potentially four files (fastq pair + md5 pair) associated with it. This example suggests that the publishing is file-centric whereas the index file is sample-centric, so we can't necessarily couple these things as a shortcut. For the In order words, the published channel needs to correspond exactly to the rows of the index file, basically option (3). I think the appeal of option (4) is to not have so much code in the output block, and to make the Need to see if there is a way to recover those properties with option (3). But I suspect the "shortcut" is to not use dynamic paths at all. workflow {
main:
ch_fastq = Channel.of( [ [:], file('1.fastq'), file('2.fastq') ] )
publish:
ch_fastq >> 'fastq'
}
output {
fastq {
// default: publish everything to 'fastq/'
// no path option needed
// dynamic path (if you want it)
path { meta, fastq_1, fastq_2 ->
{ file -> "fastq/${meta.id}/${file.baseName}" }
}
// should just work
index {
path 'samplesheet.csv'
}
}
} I think the main shortcut I was imagining is that the index file should just work with the conventional meta + file(s) structure as shown above, without any extra closure logic. Bit tricky to infer the column names for the files but still doable. Would become much simpler with record types. But the path closure seems unavoidable if you want a dynamic path, because there is no guaranteed 1-to-1 mapping of channel value to file. |
Another nice thing I realized about the double closure is that it could also support option (2) above, by returning a path instead of an inner closure. That would correspond to publishing each file from the channel into a directory without modifying the base file name: path { meta, fastq_1, fastq_2 ->
"fastq/${meta.id}"
} So there can be multiple levels of dynamism depending on how specific you want to be. |
The nested closure is very confusing. It's hard to understand when using the more than one argument or a single one. I think the problem could be reduced to having a closure with arguments, one for the context and a second for the file path. The context could be by definition the first one. |
Keep in mind that the double closure is the most extreme form of dynamic path and the least likely to be used. Doing something like output {
// many levels of dynamism
fastq {
// default: publish to 'fastq/'
// nothing
// publish to a different static path
path 'samples'
// publish to subdirectories by sample id
path { meta, fastq_1, fastq_2 ->
"fastq/${meta.id}"
}
// publish each file to its own path (extreme)
path { meta, fastq_1, fastq_2 ->
{ file -> "fastq/${meta.id}/${file.baseName}" }
}
}
} |
@robsyme I've been thinking about your point about updating the publish events. Now that I've taken another pass through everything, I see now that it would actually be quite simple to do what you suggested. We could add a new event e.g. We can keep the existing
So you can listen to either event or both based on what you need. I was worried at first because I was thinking about trying to attach metadata to each individual file, but I like this separate event much better |
You superstar. Thanks Ben!!! |
Thanks @bentsherman. Three quick questions
|
Yes it should just receive a single argument corresponding to the channel value: void onWorkflowPublish(Object value) {
def (meta, fastqs) = [ value[0], value[1] ]
// ...
} |
I guess the tricky part here is that this event hook is receiving every value from every published channel, so you'll have to do some pattern matching if you want to dig deeper into things like metadata That's actually pretty wild from the plugin's perspective... you're receiving arbitrary values that depend entirely on the pipeline you're running. Of course you could optimize around a few common use cases like "if it's a list, and the first element is a map, etc ..." We might want to think more critically about this interface |
You make an excellent point. The lack of specificity would be challenging. Given the interface workflow {
main:
ch_foo = Channel.of( [ [id: 1], file('foo.fastq') ] )
publish:
ch_foo >> 'foo'
} It would be helpful if the void onWorkflowPublish(String channelTag, Object value) {
// ... |
I had a similar thought. The target name might be nice to have just for record keeping, but it doesn't reveal anything new about the expected structure of the value It's not entirely hopeless though. If you look at A plugin basically has two options:
Both seem kinda sketchy |
The unknowability of the exact structure of the objects in channels doesn't worry me too much. Nextflow (and any plugin) should dutifully serialize the object. We already impose one important limitation on objects in channels (or at least objects that support |
I like the idea of requiring it to be serializable into some kind of format. That would be a good thing to explore with your plugin. I'll try to at least get this workflow event merged in the second preview. Is CBOR not space efficient compared to Kryo? |
CBOR and Kryo will both produce relatively efficient, compressed forms. JSON serialization will take more space than the other two, but space isn't really a concren because:
|
Hi! I don't want to hijack this thread, but I did have a couple questions related to #2844 and #5185 and I'm hoping this is an appropriate place. With the new output style, I really like the creation of an index file that contains file name/output path information. I see this being helpful in a variety of scenarios. I am specifically interested in generating md5sum hashes of my output files. This could be included in the index as an additional column per-file if enabled via the path directive? I also saw a draft about enabling E-tag aware files for Azure and AWS? I know that E-tags on AWS can vary based on the chunk/part size during upload, so I am concerned about assuming this value is always the valid hash. From what I saw it looked like the hash would be calculated first using a standard implementation, but is it validated that this matches the E-tag? Currently, with Nextflow 24.04, my plan to achieve this behavior using |
A simple solution would be to define a function that computes the md5 of a file: // Path -> String
def md5(file) {
// ...
} Then you could compute the md5 whenever you want and append it to the channel value: ch_fastq.map { meta, fastq_1, fastq_2 ->
[ meta, fastq_1, fastq_2, md5(fastq_1), md5(fastq_2) ]
} A more efficient solution would be to compute the md5 while you already have the data locally, as it is done here: https://github.com/nf-core/fetchngs/blob/c60d09be3f73376156471d6ed78fcad646736c2f/modules/local/aspera_cli/main.nf#L26-L33 Either way, you have to compute these checksums yourself if you want them to be in the index file, because you are responsible for defining the schema and columns, etc using the |
Note, I think we could provide |
whispers choose https://www.blake2.net/ |
More loudly choose BLAKE3 |
Continuation of #4670
Enumerating the proposed changes that we've collected so far:
Support additional index file formats (json, yaml)
Generate output schema. Essentially a list of index file schemas. Should be generated on each run or via some separate command. Should eventually be used with parameter schema for chaining pipelines. See DSL2+ nf-core/fetchngs#312 for a concrete example (-> Generate output schema from output definition #5213schema_outputs.yml
).Dynamic path mapping. Allow the
path
option in target definition to be a closure:Note that the dynamic path need only define the directory, not the full filename. Since a channel value may contain multiple files, an alternative syntax could be to provide the file and the channel value to the closure, so that it's clear which file is being published:
Move publish options to config. Publish options like
directory
andmode
typically need to be configurable by the user, which currently would require you to define a special param for each option. Therefore it makes more sense for them to be config settings rather than pipeline code:The output block should be used only to define index files (i.e. the output schema). In other words, the pipeline code should define what is published and the config should define how it is published.
For the output directory, it has also been proposed to provide a CLI option for it e.g.
-output-dir
and shorter config option e.g.outputDir
. The output directory would be available in the pipeline code as part of workflow metadata i.e.workflow.outputDir
.Remove publish section from process definition. Still under discussion. The rationale for the process publish section was to provide some sensible defaults which can be overridden, however I've come to think that it only makes it harder to determine what is being published. Instead of enumerating the publish targets in one place, they are scattered throughout the pipeline code. Also, process definitions are abstracted from any particular pipeline, so it doesn't make much sense to put pipeline-specific details like params and publishing in the process definition.
A better way to give some sensible defaults for a process would be to write an entry workflow in the process definition that gives an example usage:
This workflow will be ignored when importing the process as a module, but it provides a concrete example and can even be used to run the process directly. In fact it could even replace the custom nf-test DSL eventually.
Allow publish section only in entry workflow. I am less certain about this one but wanted to lay out the case for it. Building on the previous item, having publish sections potentially spread across several workflows in a pipeline makes it hard to see what all is being published. Instead, maybe named workflows should only be allowed to emit outputs, and only the entry workflow be able to publish outputs. As with the previous point, you could write an entry workflow for each named workflow which gives some example publishing (and allow you to run the workflow as a standalone pipeline).
This idea is inspired by a principle in software engineering that side effects (a.k.a. I/O, publishing) should be pushed to the "boundaries" of the code, to make it more readable and testable, and to make it easier to swap out different I/O strategies (file copy, database insert, API call, etc).
At the same time, I appreciate that publishing from a named workflow is a convenient shorthand, especially when you considering having to propagate outputs back up through potentially several nested workflows. But I wonder if being more strict here would be better in the long run. The example entry workflow is something that will be written anyway, both for testing and to run workflows as standalone pipelines.
Runtime enhancements
Include output targets in inspect command. Similar to how inspect lists all processes with some resolved directives, etc, it could also show the resolved list of output targets. Not as essential if we implement some of the above points, but still useful for things like resolving params.
The text was updated successfully, but these errors were encountered: