Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vine: most temp files do not reach their replication threshold #3996

Open
JinZhou5042 opened this issue Nov 27, 2024 · 1 comment
Open

vine: most temp files do not reach their replication threshold #3996

JinZhou5042 opened this issue Nov 27, 2024 · 1 comment

Comments

@JinZhou5042
Copy link
Member

JinZhou5042 commented Nov 27, 2024

VINE_WORKER_SOURCE_MAX_TRANSFERS defines the maximum number of ports on a worker that can be used to transfer temp files to other workers. The default value is 3.

If each temp file is assigned to replicate for 5 times, these are what will happen:

  • Initially, when the file was created, all workers are free on transfer, it can immediately replicate to 4 other workers.
  • When more files are gradually created, most workers are busy with transferring, and most files remain unreplicated, because the number of tasks is substantially larger than the number of workers.
  • The needed number of ports is very large, say we have n workers, each worker is allowed to open k ports to replicate, and we have m tasks. The required ports in total is (n-1)*k*m throughout the lifetime.
  • There is a delay from when a file is successfully replicated to when the cache update message arrives at the manager, resulting the inefficient replicating process.
  • It seems when a file is successfully replicated, the ports are not freed on that worker? (Maybe I just didn't find the code)

I posted the following figure two weeks ago and thought it was because the manager prefers task dispatching and outputs retrieving, so most files don't reach their replication threshold. But after I prioritized the replication step, I got almost the same distribution. Things are more complex than I thought.

QQ_1732742763541

@JinZhou5042 JinZhou5042 changed the title vine: too strict worker p2p transfer limitation vine: most temp files do not reach their replication threshold Nov 27, 2024
@JinZhou5042
Copy link
Member Author

JinZhou5042 commented Nov 27, 2024

I tried to extend the execution time of each task from 1s to 40s, to let the manager have some idle time to replicate, but the replication pattern didn't change. So I suppose it was not because the tasks run very fast, and the workflow ends before more replications happen.

I did another test by setting q->worker_source_max_transfers = 10000, every file was able to replicate for 5 times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

1 participant