Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Globus transfer optimization #214

Merged
merged 4 commits into from
Nov 8, 2022
Merged

Globus transfer optimization #214

merged 4 commits into from
Nov 8, 2022

Conversation

lukaszlacinski
Copy link
Contributor

This PR addresses #171.

@lukaszlacinski lukaszlacinski changed the title Lukaszlacinski/globus optimization Globus transfer optimization Aug 1, 2022
@forsyth2 forsyth2 added the priority: high High priority task label Aug 29, 2022
@forsyth2 forsyth2 added the Globus Globus label Sep 19, 2022
@forsyth2
Copy link
Collaborator

forsyth2 commented Oct 12, 2022

@lukaszlacinski @golaz I reviewed the code (including the first commit, also found in #185) as well I was able to. Overall, the code appears ok to me. I do have a few questions though:

  • globus_activate is used in create and update. Is this not needed for extract and ls? Wouldn't we always want to make sure the endpoints are activated?
  • globus_finalize is used in create and ls. Is this not needed for extract and update?

From #171:

With --non-blocking option, zstash submits a Globus transfer and immediately creates a subsequent tarball. Zstash does not wait until the transfer completes to start creating a subsequent tarball.

I see the non-blocking option only exists for create. To test, I ran zstash create --hpss=globus://nersc/home/f/forsyth/zstash214/v2.LR.historical_0201 --non-blocking v2.LR.historical_0201.

On machines where zstash tarballs are created faster than they are transferred to a remote endpoint, zstash will create multiple Globus transfers that are put in a queue by the Globus service. Zstash should submit transfers to transfer as many zstash tarballs as have been created at the moment.

On machines where it takes more time to create a tarball than transfer it, each Globus transfer will have one file.

On machines where it takes less time to create a tarball than transfer it, the first transfer will have one file, but the number of tarballs in subsequent transfers will grow finding dynamically the most optimal number of tarballs per transfer.

My test appears to be in the first case (more time to create a tarball than transfer it). I had 48 tars and 48 Globus transfers.

I also want to confirm that we should not expect simultaneous Globus transfers. I didn't check the time stamps of all 48 Globus, but it appears there was a maximum of one transfer occurring at one time. (I suppose simultaneous transfers would only be useful in the second case anyway, when we're creating tars faster than we can transfer them).

@lukaszlacinski
Copy link
Contributor Author

  • globus_activate is used in create and update. Is this not needed for extract and ls? Wouldn't we always want to make sure the endpoints are activated?

Commands that get tarballs from the storage check if Globus endpoints are activated kind of
Commands that put tarball to the storage, first create a tarball which can be a time consuming step, and then they transfer the tarball to the storage. If Globus endpoints are not activated, then the command fails after the first tarball is created. Only these commands need to check, if Globus endpoints are activated, before the first tarball is created.

  • globus_finalize is used in create and ls. Is this not needed for extract and update?

My mistake (fixed). I removed globus_finalize from ls and added it to update.

Globus transfer optimization in this PR relies on a fact that two processes: 1) creating files (tarballs, database) and 2) transferring the files can be run in parallel as Producer-Consumer. create and update commands create files and add them to a queue in globus module using hpss_put. If there is no ACTIVE transfer, globus module submits a transfer of all files in the queue. globus_finalize is a way to tell globus module that Producer finished creating files and Consumer should not expect any more files to transfer to the storage.

Transferring files and extracting them in ls, extract cannot be implemented as sequential tasks. Zstash gets files first, then extract them.

I see the non-blocking option only exists for create. To test, I ran zstash create --hpss=globus://nersc/home/f/forsyth/zstash214/v2.LR.historical_0201 --non-blocking v2.LR.historical_0201.

My mistake (fixed). I added it to update as well.

I also want to confirm that we should not expect simultaneous Globus transfers. I didn't check the time stamps of all 48 Globus, but it appears there was a maximum of one transfer occurring at one time. (I suppose simultaneous transfers would only be useful in the second case anyway, when we're creating tars faster than we can transfer them).

That's correct with one exception. The last two batches of files (tarballs and the database) in create and update commands may run as two parallel Globus transfers. When globus_finalize is called, it does not check if there is any ACTIVE transfer and wait until it completes, but immediately submits the last transfer for what is left in the queue.

Copy link
Collaborator

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, @lukaszlacinski. The unit tests pass with these changes. So, I'll merge this. Also, I will just close #185 since that commit is included in this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Globus Globus priority: high High priority task
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants