Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Async Population(multiple workers) of DataCollection #361

Closed
wants to merge 14 commits into from

Conversation

dadams39
Copy link
Contributor

Removes the cap on items in the DataCollection. This will remove the deadlocking condition for now

@dadams39 dadams39 linked an issue Jul 19, 2022 that may be closed by this pull request
3 tasks
@dadams39 dadams39 requested a review from ashmrtn July 19, 2022 03:12
@dadams39 dadams39 requested review from ryanfkeepers and removed request for ashmrtn July 19, 2022 03:12
@dadams39 dadams39 requested review from ashmrtn and vkamra July 19, 2022 03:12
@dadams39 dadams39 added the bug Something isn't working label Jul 19, 2022
@dadams39 dadams39 self-assigned this Jul 19, 2022
@dadams39 dadams39 temporarily deployed to Testing July 19, 2022 03:29 Inactive
@dadams39 dadams39 temporarily deployed to Testing July 19, 2022 03:34 Inactive
dadams39 added 2 commits July 19, 2022 10:21
Added sort and sort test for DataCollections
@dadams39 dadams39 marked this pull request as draft July 19, 2022 14:51
@ashmrtn
Copy link
Contributor

ashmrtn commented Jul 19, 2022

wait, I'm confused, didn't we agree this would make things more likely to deadlock if there's lots of folders?

@dadams39
Copy link
Contributor Author

dadams39 commented Jul 19, 2022

@ashmrtn There are 2 components and as long as one is making progress, we will not deadlock. On the left, the current implementation's GC will block on the first directory and not move. KW's threads are arbitrarily assigned and not on the blocked directory and the instance will time out after ten minutes. On the right, the GC will move to the next directory after it serializes all the messages in the first directory. I am in no way saying that this will stop all the failures for Issue #356. The current implementation can be extremely slow if KW is idle waiting on the smallest directories (see Issue #362 ) It will stop the deadlock though.
Screen Shot 2022-07-19 at 2 33 11 PM

But, of course, you are correct because that is not how Golang works. We don't want to worry about the size of the channel; however, we do need to place an arbitrary number in there that we should reasonably not exceed in testing. So, we are changing the title to increase the limit and working from there.

@dadams39 dadams39 changed the title Issue #360: Remove limit on the data collection. Issue #360: Increase limit on the data collection. Jul 19, 2022
@ashmrtn
Copy link
Contributor

ashmrtn commented Jul 19, 2022

ok, that makes more sense. I was confused because the patch was removing the channel buffer which meant that it would block the first time it was pulling data for a folder outside the set of folders kopia was currently uploading

Fixes Issue #364, Fixes deadlock from #360
@dadams39 dadams39 temporarily deployed to Testing July 20, 2022 00:19 Inactive
@dadams39 dadams39 temporarily deployed to Testing July 20, 2022 00:19 Inactive
@dadams39 dadams39 marked this pull request as ready for review July 20, 2022 00:20
@dadams39 dadams39 requested review from ryanfkeepers and ashmrtn July 20, 2022 00:20
@dadams39
Copy link
Contributor Author

@ryanfkeepers, added a change to be able to call a function MergeStatus(num int) which allows the status for all things in the pipe to be waited on and have all the status compiled into one.

@dadams39 dadams39 temporarily deployed to Testing July 20, 2022 00:23 Inactive
@dadams39 dadams39 temporarily deployed to Testing July 20, 2022 00:23 Inactive
@dadams39 dadams39 temporarily deployed to Testing July 20, 2022 13:01 Inactive
@dadams39 dadams39 temporarily deployed to Testing July 20, 2022 13:01 Inactive
src/internal/connector/exchange_data_collection.go Outdated Show resolved Hide resolved
src/internal/connector/graph_connector.go Outdated Show resolved Hide resolved
src/internal/connector/graph_connector.go Outdated Show resolved Hide resolved
src/internal/connector/graph_connector.go Show resolved Hide resolved
src/internal/connector/graph_connector.go Outdated Show resolved Hide resolved
src/internal/connector/graph_connector.go Outdated Show resolved Hide resolved
src/internal/connector/support/status.go Show resolved Hide resolved
src/internal/connector/graph_connector.go Show resolved Hide resolved
src/internal/connector/graph_connector.go Show resolved Hide resolved

for aFolder := range tasklist {
// async call to populate
collections, err := gc.launchProcesses(ctx, tasklist, user)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as launchProcesses may do synchronous work if it reaches the limit on workers, wouldn't this cause it to block buffering data until launchProcesses returns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a large BFS the processes are used up and the main thread would take the next folder. Ideally, the main would return and more processes would have finished in the process. I think I will add a sleep function instead for the main because it would be unfortunate if the main took a folder and was held up by the number of items for the channel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In regard to the sorting question @ashmrtn , you are 110% correct. I left the sorting in for the requirement involving fault tolerance. The key point of the proposal that I saw was that a lot of these mechanisms are going to be thrown away later. However, in the interim, it is an interesting look at some of the capabilities and workflows we can create with Go

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: launchProcesses, I think eventually we can solve the blocking requirement and streamline the code by doing a couple of things:

  1. decouple getting the data collections to return from getting the data in said collections
  2. if we don't switch to a solution where KW/the DataCollection launch getting items, we may want to use a worker pool to fetch the items

1/ is already mostly implemented, thanks to how the functions are broken down. The major change will be what is actually executed in the background. For example,

  collections := serializeMessages(...)

  go func() {
    // loop through task list and either call populateItems(...) or
    // create another goroutine to run populateItems(...)
  }

  return collections

will get the collections synchronously, return them, and then fetch all the data in the background. The code that is currently in the anonymous function could be pulled inot a separate function if desired (the above is just a quick example). Dispatch of folder -> goroutine with a bound on the number of goroutines running at any one time can then be done in launchProcesses

For 2/, the worker pool will ensure good parallelism even if we have situations where the workers got folders with very few items while the dispatching goroutine ended up fetching a folder with many items (thus keeping it from dispatching more work)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a sleep command for the interim

@dadams39 dadams39 changed the title Issue #360: Increase limit on the data collection. Async Population(multiple workers) of DataCollection Jul 20, 2022
dadams39 added 2 commits July 20, 2022 14:39
Added sleep call, context -> ctx, and removal of debug statements throught fmt.
@dadams39 dadams39 temporarily deployed to Testing July 20, 2022 18:40 Inactive
@dadams39 dadams39 temporarily deployed to Testing July 20, 2022 18:40 Inactive
@dadams39 dadams39 temporarily deployed to Testing July 20, 2022 18:58 Inactive
@dadams39 dadams39 temporarily deployed to Testing July 20, 2022 18:58 Inactive
Copy link
Contributor

@vkamra vkamra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a block - let's not merge till we agree on final design.

@dadams39 dadams39 marked this pull request as draft July 20, 2022 20:47
@dadams39 dadams39 closed this Jul 21, 2022
@dadams39
Copy link
Contributor Author

This PR is to be closed in favor of implementing a PR that fixes the deadlock circumstance.

@dadams39 dadams39 deleted the gc-remove-limit branch August 19, 2022 16:51
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor GC populator to address deadlock with Kopia uploader
4 participants