Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multiple instances of Tribler on Flatpak #7621

Closed
wants to merge 2 commits into from

Conversation

xoriole
Copy link
Contributor

@xoriole xoriole commented Oct 6, 2023

This PR fixes the multiple instances of Tribler running on Flatpak. If the app is connected to the previous instance via QLocalSocket then there is another instance running. In this case, Tribler closes now by passing its arguments to the running process via the QT local socket.

In addition to that, in the PR, for the core process, it is now checked if the port is known whether or not REST API is accessible. If the API is not accessible, then the core process is considered not running. Checking on the pid is not sufficient on the Flatpak environment because the pid is not unique anymore on the sandbox environment.

This PR addresses the issue of multiple instances of Flatpak application running in parallel and crashing Tribler when one of the instances is closed. It is addressed in the following way.

If the last primary process from the database has a PID that either matches the current process PID or can be confirmed to be running, then the current process instance is terminated. This prevents the second instance of Tribler from running.

It could be argued that the PID could be reused and could lead to Tribler terminating itself and no instances running at all. In such a case, the likelihood of the last Tribler process having the same PID as the current process and the database having the process record with a primary set to 1, in my opinion, is negligible. Furthermore, if such a case happens, Tribler will likely start on the next run as the PID would most likely change from the previous run which in my opinion is acceptable.

Fixes #7626
Fixes partially #7603

@xoriole xoriole marked this pull request as ready for review October 6, 2023 14:56
@xoriole xoriole requested review from a team and egbertbouman and removed request for a team October 6, 2023 14:56
@kozlovsky kozlovsky self-requested a review October 9, 2023 06:56
@kozlovsky
Copy link
Contributor

Thank you for your efforts in addressing this issue. After looking it over, I have some concerns about the correctness of the suggested approach. I'm diving deeper into the details and will provide a more comprehensive review in about half an hour.

Copy link
Contributor

@kozlovsky kozlovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, it is an interesting attempt to solve the issue, but unfortunately, I think it can bring new significant issues and make the code logic more complicated. I think we need to find a different approach to this issue.

@@ -64,7 +64,7 @@ def run_gui(api_port: Optional[int], api_key: Optional[str], root_state_dir, par
translator = get_translator(settings.value('translation', None))
app.installTranslator(translator)

if not current_process_is_primary:
if not current_process_is_primary or app.connected_to_previous_instance:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR contains two fixes for the GUI and Core processes. Let's first look at the fix for the GUI process, which is implemented as an extended check at line 67 that adds the "app.connected_to_previous_instance" condition.

It looks like this PR can indeed solve the problem on Flatpak for the case when the first (primary) Flatpack-based Tribler instance is already running and the second (secondary) Flatpack-based Tribler instance is starting. Before your fix, the second GUI process incorrectly considers itself a primary instance because it has the same PID as the first GUI process. With your changes, the second GUI process is regarded as a primary process but finishes immediately because it can connect to the first GUI process via the local socket.

However, the solution proposed in this PR creates a race condition for the case when two instances of the Tribler application are starting at the same moment. This situation is very common for some obscure reason and needs to be handled properly. Unfortunately, with the simultaneous start of two GUI processes, it is possible to have a race condition when the first process claims itself primary in the processes.sqlite lock file, while the second process was the first to grab the local socket. In that case, it can be possible that the first (primary) process is connected to the local socket of the secondary process. It can lead to a situation when both processes stop simultaneously, the second process because it is not primary and the first because it was able to connect to the local socket of the secondary process.

In other words, this change fixes the Flatpak issue but re-creates a race condition for the non-Flatpak case. Previously, the race condition manifests itself as two GUI processes running in parallel, and now it appears as a Tribler application that refuses to start at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your feedback. Here's a more detailed explanation:

  1. The app.connected_to_previous_instance was introduced to enhance the current check determining if the process is primary. Although the existing check for the primary process should ideally address the scenario where multiple instances of Tribler exist, it doesn't perform as expected in environments like Flatpak. The added condition is supplemental (an OR condition), ensuring it doesn't disrupt the standard execution flow.

  2. The value of app.connected_to_previous_instance is set to True only when the new GUI process successfully connects to the local socket. If a race condition were to occur, this value would be set to False, retaining the original execution behavior, in my opinion.

I haven't encountered or been able to reproduce a race condition scenario in the current setup. However, I'm keen to understand your perspective and would welcome any suggestions on potential improvements or alternative solutions.

Copy link
Contributor

@kozlovsky kozlovsky Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to encounter the following case:

  1. Two GUI processes started at the same time (that was a very frequent issue until we implemented a new SQLite-based lock; the previous triblerd.lock implementation had race conditions in this scenario).
  2. The first GUI process was first to grab the file lock and mark itself as a primary process.
  3. The second GUI process was first to grab the local socket

I believe it is a pretty frequent scenario. In my opinion, the current logic in the PR does not handle this situation properly. To handle it properly, the GUI process should grab the local socket while holding the file lock. It can complicate the logic a bit and increase the time the process holds the file lock.


try:
docs_url = f"http://localhost:{self.api_port}/docs"
_ = requests.get(docs_url, timeout=API_CHECK_TIMEOUT_IN_SECONDS)
Copy link
Contributor

@kozlovsky kozlovsky Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second fix proposed in this PR is an HTTP request to the Tribler REST API added to the TriblerProcess.is_running method.

Note that the is_running method is called, for example, when TriblerProcess.__str__() is called. It looks a bit extreme to me to issue an HTTP request whenever someone tries to print an instance of a core process.

Second, it is not clear why it is helpful to have this check for the Core process for two reasons:

  1. If the logic for GUI processes is correct, the second GUI instance should understand that it is secondary and should not start the second Core process at all. So, the check for the Core is less important than for the GUI if the GUI check is implemented correctly.
  2. It is hard to say what this check allows us to achieve. Initially, when the Core process is just started, its api_port is always None. Not that with the current implementation, the GUI process does not pass the API port number to the Core process. The Core process opens the REST API on any available port and sets its value to the database so the GUI process can look up the port value. So when the Core process is just starting, its api_port is always None, and this check will not be triggered. When two Core processes are already running by mistake, their api_port values will differ. Each Core process has a different value of its API port and can connect to it (by the way, can the code create a deadlock when the Core process connects to its port? The call to request.get is synchronous, so it looks like it should block the asyncio loop). It is hard to say why this check of the HTTP endpoint availability is useful and how correctly it works in all different use cases.

@xoriole xoriole marked this pull request as draft October 10, 2023 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants