Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Integration CI Testing Failing on Multiple Branches #356

Closed
1 task done
dadams39 opened this issue Jul 18, 2022 · 8 comments
Closed
1 task done

Integration CI Testing Failing on Multiple Branches #356

dadams39 opened this issue Jul 18, 2022 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@dadams39
Copy link
Contributor

dadams39 commented Jul 18, 2022

Receiving Timeout Errors during one of the operation tests:

@dadams39 dadams39 added the bug Something isn't working label Jul 18, 2022
@ryanfkeepers
Copy link
Contributor

Fwiw, the creation of a test file isn't necessary to run local integration tests. But I can see that the document lacks details either way. Adding some of that now.

@dadams39
Copy link
Contributor Author

I get that it is not. I would imagine users with less experience with the project would have similar experiences.

@dadams39
Copy link
Contributor Author

Findings

Fundamental failure was caused by the following:

  • collectionChannelBufferSize = 1000
  • Outlook not displaying all of the files in the web browser

During the CI tests, these invisible messages were being added to the collection and after reaching the 999th item, the process would hang and the tests would eventually timeout. Investigation is ongoing to understand why the BackupWriter didn't reduce the number of messages in the channel.

@ashmrtn
Copy link
Contributor

ashmrtn commented Jul 18, 2022

Investigation is ongoing to understand why the BackupWriter didn't reduce the number of messages in the channel.

It's possible that (async) GC and BackupWriter were working on disjoint sets of folders and that caused a deadlock when the channel for ExchangeDataCollection got full. Kopia uses some (configurable) number of goroutines to upload items. Each goroutine works on a single file or directory at a time and processes it to completion before moving to another folder or directory. However, if the number of folders that need uploaded is larger than the number of goroutines it's possible that the GC goroutine(s) are adding items to folders that kopia is not currently uploading.

As an example, kopia may have 1 goroutine trying to upload items in the inbox directory. GC may have one goroutine adding items to the DataCollection for archive. However, because neither component will examine other directories before it's done with the current one, the system will deadlock if GC tries to fetch more messages than the channel size for the ExchangeDataCollection backing the archive directory

Increasing the number of goroutines kopia uses may help, but cannot guarantee a deadlock will not occur. This is because there can be an arbitrary number of folders in a backup

@ryanfkeepers
Copy link
Contributor

Synchronous deadlock explanation:

SerializeMessages() loaded messages according to the following algorithm:

  1. Retrieve all user's folders+messageIDs.
  2. Aggregate messages by their folder id.
  3. Iterating through each folder, download all messages in the folder, feeding each into a DataCollection channel.
  4. Return the loaded set of folders as a slice of DataCollections for downstream consumers.

DataCollection channels used a buffer limit of 1000 entries. If any folder exceeded 1000 messages, the channel buffer would fill, blocking the function from continuing to load any further messages or folders, effectively locking the entire system.

@vkamra
Copy link
Contributor

vkamra commented Jul 21, 2022

Addressing #360 will fix this issue. In addition to that fix - we should also reduce the scope of the Integration test(backup a specific folder only, restore as COPY in a restore- folder). The larger scoped, long-running tests should not run in CI.

@dadams39
Copy link
Contributor Author

dadams39 commented Aug 5, 2022

Explanation StrikeThrough of Document/Adjust long-running processes...

The major changes were #360 to deal with deadlock and a large refactor along the lines #361 that was implemented in stages. Both of those code changes are present in main as of today.

The final cause of tests timing out are to be addressed in PR #479

@dadams39
Copy link
Contributor Author

dadams39 commented Aug 5, 2022

CI documentation is a constantly improving process. Steps have been made to improve the initial setup of developer environments. While the process is not complete, the main objective of this issue was to address the cascading failures that were experienced several days ago. Those CI failures have been addressed or have a separate issue in the repository. The issue is to be closed.

@dadams39 dadams39 closed this as completed Aug 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants