-
Notifications
You must be signed in to change notification settings - Fork 7
File Transfers in jobsub lite
For the last couple of years, jobsub users have been staging custom code to worker nodes via tar files. At first, these were stored on scratch dCache Pools, and copied into jobs from there. The issue with this was that dCache often became overwhelmed when many jobs tried to access the same tar file simultaneously. To remedy this, Fermilab set up the resilient dCache pools where files are replicated 20x and provided a jobsub interface to seamlessly upload tarballs at submission time and transfer them to jobs at runtime (see http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=6763). Although this solution has worked well to mitigate issues present before, it still presents a number of challenges and strain on the dCache infrastructure.
A new solution was thus proposed to use CernVM File System (CVMFS) to publish tarballs, make these repositories widely available, and handle cleanup all within one service. This would allow for similarly rapid code distribution along with taking advantage of the caching and wide availability inherent to CVMFS repositories across the Open Science Grid (OSG).
jobsub_lite natively supports using this Rapid Code Distribution Service (RCDS) as its default method of distributing user tarfiles to jobs.
The quick summary of the file transfer mechanisms is the following:
-
-f
will transfer a file to the job, but will NOT unpack it for you -
--tar-file-name
must be used with the dropbox or tardir URIs, and will transfer a tar archive (either user-generated, or jobsub-generated from a given directory with the tardir URI) to the job, and provide access to the untarred files.
Both -f
and --tar-file-name
can be used multiple times and in combination with each other.
- If you give
-f /x/y/z.tar
(with nodropbox://
URI), jobsub_lite will simply transfer the file to the job at submission time to the$CONDOR_DIR_INPUT
directory in the job, and will not attempt to unpack it, even thoughz.tar
is a tarball. The user is responsible for unpacking the transferred tarball if needed. In this case, the file /x/y/z.tar MUST be grid-accessible (able to be copied in via ifdh) - If you give
-f dropbox:///x/y/z.tar
, thenz.tar
will be transferred to the job at job runtime in the$CONDOR_DIR_INPUT
directory, and the user will still need to unpack the tarball.
- In the user job, when
--tar-file-name
is used on the jobsub_submit command line,$INPUT_TAR_DIR_LOCAL
always contains the unpacked tarball, either directly or through symlinks. In this case, the usage of$INPUT_TAR_DIR_LOCAL
is recommended over the other options (see below) - To have a tarball unpacked in the job automatically, users will need to use the
--tar-file-name
flag.jobsub_submit --tar-file-name dropbox:///x/y/z.tar
unpacks the tarfile for you. Using this will transferz.tar
to$INPUT_TAR_FILE
in the job, and unpack the contents ofz.tar
into the same directory in which$INPUT_TAR_FILE
resides. - jobsub_lite can also tar up a directory for you, using the
tardir
URI with--tar-file-name
.jobsub_submit --tar-file-name tardir:///path/to/mydir
will pack up/path/to/mydir
as a tarfile, upload it to the dropbox, and unpack it in the job into the same directory in which$INPUT_TAR_FILE
resides. However.... - In the job,
$CONDOR_DIR_INPUT
has a symlink to the unpacked tarball with the correct name -
dirname($INPUT_TAR_FILE)
always contains the unpacked tarball, either directly, or via symlinks.
Upon the execution of the jobsub_submit command, jobsub_lite will create its own copy of the tarball (or in the case of using the tardir URI, will create a tarball from the given directory) with 0755 permissions on each file. It then calculates the hash of the copied tarball. With this hash, depending on the dropbox method (RCDS or PNFS), it will handle the upload of the tarball.
The RCDS is the default mechanism for jobsub_lite to distribute user tarballs to jobs. When the dropbox or tardir URIs are used (with -f
for the former, and --tar-file-name
for either URI), if the tarball is determined to not be present in the RCDS repos, jobsub_lite will upload the tarball to one of the RCDS publish servers. Each server will unpack the tarball into one of the RCDS repositories. The default behavior of jobsub_lite is to wait for the publish to finish and the tarball to be available before allowing job submission to continue. jobsub_lite will wait up to 10 minutes for each RCDS file to be found before failing the submission. This behavior can be overridden using the --skip-check rcds
flag/value.
When a job starts to run, it will verify that the tarball is available at the prescribed repository, and provide access to the unpacked tarball as explained above, depending on the exact flags used to upload the tarball. If the job cannot find the tarball, it will keep retrying for 20 minutes before failing the job.
Diving a little deeper into what happens after the initial publish of the tarball, the various CVMFS Stratum 1 machines check the RCDS repositories twice per minute for updates, and the the Stratum 1 squids check for updates after caching for one minute. The CVMFS clients across the OSG will check for updates after caching for 15 seconds. The combined effect of these is that the turnaround from tarball upload to availability across CVMFS is much less than a standard CVMFS-published file.
All tarball hash directories are removed from the RCDS repositories 30 days after the last time they were uploaded or checked by either jobsub_lite or the running job.
As an alternative, users can still use the resilient dCache pools to upload tarballs, though this is not recommended. By passing the --use-pnfs-dropbox
flag to the jobsub_submit
command, jobsub_lite will upload ALL tarfiles given in that command to resilient dCache, rather than RCDS.
As with RCDS, a file in resilient dCache is deleted 30 days after the last time a job advertised that it needed the file.
Here is a contrived example that uses ALL of the flags above, and explains where each file can be found in the job. Suppose we submit a job using the following command:
jobsub_submit -G fermilab -f /grid/accessible/path/to/myfile1 -f dropbox:///local/path/to/myfile2 --tar-file-name dropbox:///local/path/to/mytar.tar --tar-file-name dropbox:///local/path/to/mytar2.tar --tar-file-name tardir:///local/path/to/mydir1 --tar-file-name tardir:///local/path/to/mydir2
The uploaded files/directories will be located as follows:
-
myfile1
will be copied into the job, and will be available in the directory given by$CONDOR_DIR_INPUT
-
myfile2
will be uploaded at submission time, and copied into the job at runtime. It will also be available in$CONDOR_DIR_INPUT
-
mytar.tar
will be untarred in the dropbox, and the contents of the tarball available in the job under the directory given by$INPUT_TAR_DIR_LOCAL
. -
mytar2.tar
will be untarred in the dropbox, and the contents of the tarball available in the job under the directory given by$INPUT_TAR_DIR_LOCAL_1
. -
mydir1
will be tarred up at submit time, uploaded to the dropbox and untarred, and the contents of the directory available in the job under the directory given by$INPUT_TAR_DIR_LOCAL_2
-
mydir2
will be tarred up at submit time, uploaded to the dropbox and untarred, and the contents of the directory available in the job under the directory given by$INPUT_TAR_DIR_LOCAL_3