Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs request: Fetching remote files #5493

Closed
trev-f opened this issue Nov 11, 2024 · 6 comments · Fixed by #5523
Closed

Docs request: Fetching remote files #5493

trev-f opened this issue Nov 11, 2024 · 6 comments · Fixed by #5523
Assignees
Labels

Comments

@trev-f
Copy link

trev-f commented Nov 11, 2024

New feature (docs)

I would like to request documentation describing how remote files are downloaded/staged in Nextflow.

Usage scenario

Projects that require fetching large amounts of data from remote sources are common, and it's necessary to fetch those files in an efficient manner. While Nextflow makes it easy to download remote files, the lack of documentation on how remote files are handled makes it difficult to evaluate when to fetch files with this built-in Nextflow option versus building a more tailored solution.

Currently, the lack of documentation makes it difficult to build a mental model for how downloading remote files works in Nextflow. Since fetching remote data can be a massive bottleneck for some projects, it's imperative that users understand how Nextflow works so that we can build more efficient workflows.

Suggest implementation

In the remote files docs, answer some basic questions about how remote files are handled, such as:

  • What triggers the download of a remote file? When the file() method is called on a string that resembles a path? When a Channel is created from a Path object? When a Path object inside a Channel is accessed?
  • What is the default storage location for fetched files?
  • Are remote files cached, or are they fetched again for each run of a pipeline? Does -resume affect this behavior?
  • Is there a way to publish downloaded files to a specified location, or do pipeline developers need to write bespoke solutions using the methods defined for Path objects?
  • What Nextflow job is responsible for downloading files? Is this done in the main job?
  • How are remote files actually fetched? What packages/softwares/tools in Groovy are used to perform the download?
@bentsherman
Copy link
Member

bentsherman commented Nov 13, 2024

@trev-f To answer your immediate questions:

  • remote file download is triggered when a task is created with an input file that does not reside on the same filesystem as the task work directory

  • remote files are staged into the work directory in a special subdirectory of the form stage-<hash>. need to consider whether it's worth documenting the components of that hash

  • remote files are cached as best as they can using the aforementioned hash. of course if the same remote file is requested by multiple tasks at the same time, they will likely each download a separate copy to separate folders.

  • If you don't want to rely on the built-in remote file staging, you can write a custom process to download the file into a task directory. Make sure to provide the file name as a val input instead of a path input so that it isn't staged by Nextflow

  • Nextflow itself downloads these files. As you can imagine, this doesn't always scale well, which is why we generally recommend using S3 for inputs and work directory + Fusion, so that the tasks can stage the input files transparently from either location

  • We try to use standard libraries as much as possible. For HTTP/FTP we use HttpURLConnection and FtpURLConnection, for S3 we use the AWS Java SDK, etc. You can look at the various implementations of FileSystemProvider in the Nextflow codebase for details

@christopher-hakkaart I think we can add a section under Workflow with files > Remote Files, what do you think? You can try a first draft if you want, but I might need to do it myself because I need to check a few details in the code. In any case, this would be a great thing to document as it is a mystery to many users and unfortunately doesn't rise to the level of something that just always magically works.

@christopher-hakkaart
Copy link
Contributor

Hi both, I'll write a draft and link the issue for feedback/corrections.

@bentsherman
Copy link
Member

Sounds good, once you have a first draft I can add some details as needed

christopher-hakkaart added a commit to christopher-hakkaart/nextflow that referenced this issue Nov 19, 2024
@zihhuafang
Copy link

zihhuafang commented Nov 26, 2024

Hi!
I would like to follow up on the best practices for handling scenarios where multiple tasks request the same remote file simultaneously. Sorry, I am not too familiar with Fusion. I have a hybrid workflow, where some processes run on local and some run on google batch using the same remote file. If Fusion is enabled, will the same remote file will still be downloaded multiple times for those running on local? Since downloading from cloud storage incurs costs, ideally, the file should only be downloaded once in such cases.

@bentsherman
Copy link
Member

I think the shared file would still be downloaded multiple times because with Fusion each task is responsible for downloading its inputs. For cloud tasks this is fine because each task has its own VM and needs to download the input files anyway. For local tasks using Fusion it is suboptimal because theoretically the local tasks could cache and reuse the same input file. We just don't have a mechanism to do that with Fusion as far as I know

@bentsherman bentsherman linked a pull request Dec 12, 2024 that will close this issue
@bentsherman
Copy link
Member

Slight correction to my original response, it looks like the remote staging is designed to handle concurrent requests for the same file. Nextflow will coordinate these requests to make sure that a given file is downloaded once and reused by all tasks that request it, even if they do so at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants