Skip to content

File Transfers in jobsub lite

Shreyas Bhat edited this page Aug 10, 2023 · 2 revisions

Introduction

For the last couple of years, jobsub users have been staging custom code to worker nodes via tar files. At first, these were stored on scratch dCache Pools, and copied into jobs from there. The issue with this was that dCache often became overwhelmed when many jobs tried to access the same tar file simultaneously. To remedy this, Fermilab set up the resilient dCache pools where files are replicated 20x and provided a jobsub interface to seamlessly upload tarballs at submission time and transfer them to jobs at runtime (see http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=6763). Although this solution has worked well to mitigate issues present before, it still presents a number of challenges and strain on the dCache infrastructure.

A new solution was thus proposed to use CernVM File System (CVMFS) to publish tarballs, make these repositories widely available, and handle cleanup all within one service. This would allow for similarly rapid code distribution along with taking advantage of the caching and wide availability inherent to CVMFS repositories across the Open Science Grid (OSG).

jobsub_lite natively supports using this Rapid Code Distribution Service (RCDS) as its default method of distributing user tarfiles to jobs.

The quick summary of the file transfer mechanisms is the following:

  • -f will transfer a file to the job, but will NOT unpack it for you
  • --tar-file-name must be used with the dropbox or tardir URIs, and will transfer a tar archive (either user-generated, or jobsub-generated from a given directory with the tardir URI) to the job, and provide access to the untarred files.

Both -f and --tar-file-name can be used multiple times and in combination with each other.

-f

  • If you give -f /x/y/z.tar (with no dropbox:// URI), jobsub_lite will simply transfer the file to the job at submission time to the $CONDOR_DIR_INPUT directory in the job, and will not attempt to unpack it, even though z.tar is a tarball. The user is responsible for unpacking the transferred tarball if needed. In this case, the file /x/y/z.tar MUST be grid-accessible (able to be copied in via ifdh)
  • If you give -f dropbox:///x/y/z.tar, then z.tar will be transferred to the job at job runtime in the $CONDOR_DIR_INPUT directory, and the user will still need to unpack the tarball.

--tar-file-name

  • In the user job, when --tar-file-name is used on the jobsub_submit command line, $INPUT_TAR_DIR_LOCAL always contains the unpacked tarball, either directly or through symlinks. In this case, the usage of $INPUT_TAR_DIR_LOCAL is recommended over the other options (see below)
  • To have a tarball unpacked in the job automatically, users will need to use the --tar-file-name flag. jobsub_submit --tar-file-name dropbox:///x/y/z.tar unpacks the tarfile for you. Using this will transfer z.tar to $INPUT_TAR_FILE in the job, and unpack the contents of z.tar into the same directory in which $INPUT_TAR_FILE resides.
  • jobsub_lite can also tar up a directory for you, using the tardir URI with --tar-file-name. jobsub_submit --tar-file-name tardir:///path/to/mydir will pack up /path/to/mydir as a tarfile, upload it to the dropbox, and unpack it in the job into the same directory in which $INPUT_TAR_FILE resides. However....
  • In the job, $CONDOR_DIR_INPUT has a symlink to the unpacked tarball with the correct name
  • dirname($INPUT_TAR_FILE) always contains the unpacked tarball, either directly, or via symlinks.

Dropbox Mechanisms

How it works

Upon the execution of the jobsub_submit command, jobsub_lite will create its own copy of the tarball (or in the case of using the tardir URI, will create a tarball from the given directory) with 0755 permissions on each file. It then calculates the hash of the copied tarball. With this hash, depending on the dropbox method (RCDS or PNFS), it will handle the upload of the tarball.

Rapid Code Distribution Service (RCDS) via CVMFS

The RCDS is the default mechanism for jobsub_lite to distribute user tarballs to jobs. When the dropbox or tardir URIs are used (with -f for the former, and --tar-file-name for either URI), if the tarball is determined to not be present in the RCDS repos, jobsub_lite will upload the tarball to one of the RCDS publish servers. Each server will unpack the tarball into one of the RCDS repositories. The default behavior of jobsub_lite is to wait for the publish to finish and the tarball to be available before allowing job submission to continue. jobsub_lite will wait up to 10 minutes for each RCDS file to be found before failing the submission. This behavior can be overridden using the --skip-check rcds flag/value.

When a job starts to run, it will verify that the tarball is available at the prescribed repository, and provide access to the unpacked tarball as explained above, depending on the exact flags used to upload the tarball. If the job cannot find the tarball, it will keep retrying for 20 minutes before failing the job.

Diving a little deeper into what happens after the initial publish of the tarball, the various CVMFS Stratum 1 machines check the RCDS repositories twice per minute for updates, and the the Stratum 1 squids check for updates after caching for one minute. The CVMFS clients across the OSG will check for updates after caching for 15 seconds. The combined effect of these is that the turnaround from tarball upload to availability across CVMFS is much less than a standard CVMFS-published file.

All tarball hash directories are removed from the RCDS repositories 30 days after the last time they were uploaded or checked by either jobsub_lite or the running job.

PNFS

As an alternative, users can still use the resilient dCache pools to upload tarballs, though this is not recommended. By passing the --use-pnfs-dropbox flag to the jobsub_submit command, jobsub_lite will upload ALL tarfiles given in that command to resilient dCache, rather than RCDS.

As with RCDS, a file in resilient dCache is deleted 30 days after the last time a job advertised that it needed the file.

An extremely contrived example

Here is a contrived example that uses ALL of the flags above, and explains where each file can be found in the job. Suppose we submit a job using the following command:

jobsub_submit -G fermilab -f /grid/accessible/path/to/myfile1 -f dropbox:///local/path/to/myfile2  --tar-file-name dropbox:///local/path/to/mytar.tar --tar-file-name dropbox:///local/path/to/mytar2.tar --tar-file-name tardir:///local/path/to/mydir1 --tar-file-name tardir:///local/path/to/mydir2

The uploaded files/directories will be located as follows:

  • myfile1 will be copied into the job, and will be available in the directory given by $CONDOR_DIR_INPUT
  • myfile2 will be uploaded at submission time, and copied into the job at runtime. It will also be available in $CONDOR_DIR_INPUT
  • mytar.tar will be untarred in the dropbox, and the contents of the tarball available in the job under the directory given by $INPUT_TAR_DIR_LOCAL.
  • mytar2.tar will be untarred in the dropbox, and the contents of the tarball available in the job under the directory given by $INPUT_TAR_DIR_LOCAL_1.
  • mydir1 will be tarred up at submit time, uploaded to the dropbox and untarred, and the contents of the directory available in the job under the directory given by $INPUT_TAR_DIR_LOCAL_2
  • mydir2 will be tarred up at submit time, uploaded to the dropbox and untarred, and the contents of the directory available in the job under the directory given by $INPUT_TAR_DIR_LOCAL_3