-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot create a project in cloud #9789
Comments
So there is bunch of errors in messages and logs from language server. @hubertp could you take a look? @somebody1234 have anything changed in communication with the server? |
|
seems like this is the reason why: |
I'm doubtful. The flatbuffer's format does not change between different versions of flatc - so the format emitted by code generated with never version of flatc should be happily read by code generated by older version, if the schema files were the same - and those haven't been touched from a long time. And the fact, that the local projects are working properly seems to confirm it. Any issue with binary connection should be visible when running local projects too. But this does not happen. |
ah, weird... in that case i'm not sure why payload type 1 is an error then: enso/app/gui2/shared/binaryProtocol.ts Line 708 in 660c5e7
since it seems to be the session initialization payload... |
so it seems line engine does not handle duplicate INIT_SESSION messages. Lines 192 to 207 in 660c5e7
i assume then, that this is due to the same 15 second timeout issue from before? |
i assume cloud.enso.org still doesn't work as it's still using GUI1? |
Whaaa? Well that would explain things. But I talked to @PabloBuchu yesterday and he said it is running |
hmm. i guess it's running develop, but the logic to fetch GUI from S3 has not been changed - and i guess S3 has not been updated with GUI2 it's worth noting that cloud.enso.org does not work anyway (afaict), because we do not yet have the Y.js server (for syncing GUI state and text file state) running on the cloud. |
I'm seeing (with trace level logs) something suspicious:
As if all libraries were gone |
Actually that doesn't seem to be the cause of the problem. I will need to investigate why it happens in the first place but it seems to recover from this problem. |
We have had the GUI2 running with a cloud backend in electron. Yes we have to run the y.js server locally but it was working. |
yup - by "cloud.enso.org" i mean specifically the website hosted at cloud.enso.org. when we run through electron (and on the browser of the electron server!) i believe we do use the local version of the GUI, as the check for whether we need to download the GUI, is determined by whether the Local Backend (the Project Manager backend) is supported |
So I have a fix for the timeout that is being reported when using electron app and connection to the cloud. The issue is basically... weak hardware. On my machine first compilation takes max 3-4 seconds on a bad day. In the cloud setup it takes at least 25 seconds alone. Given that we have timeouts for serving API requests set to 5 seconds, the problem becomes rather common. When running via browser, I believe it is not running the latest GUI? At least the type of requests seem a bit outdated and different from what electron (nightly) is doing. |
When `PROFILING_FILENAME` and `PROFILING_TIME` are set, language server will collect profiling data on startup and place it under `/opt/enso/profiling/$PROFILING_NAME` where it can be fetched from. Needed to better analyze #9789.
Hubert Plociniczak reports a new STANDUP for the provided date (2024-05-03): Progress: Paused work on #9736 to investigate cloud issues. Looks like slow compilations are delaying handling of requests leading to timeouts. Added workaround in multiple handlers so that we don't return timeout immediately. It should be finished by 2024-05-06. Next Day: Next day I will be working on the #9789 task. Continue the investigation. Figure out if we can profile the startup to find the root cause. |
I'm seeing occasional IO timeouts, especially on startup operations, for cloud projects. Adding some logging to make an informed decision if there are some problems there. Related to #9789 # Important Notes Also added retries when closing the file as I saw a number of times: ``` Session release failed. LsRpcError: Language server request 'text/closeFile' failed. at LanguageServer.request (/tmp/.mount_enso-leMqqdS/resources/app.asar/index.cjs:58291:15) at async Promise.all (index 0) at async _LanguageServerSession.release (/tmp/.mount_enso-leMqqdS/resources/app.asar/index.cjs:59165:5) at async /tmp/.mount_enso-leMqqdS/resources/app.asar/index.cjs:59670:7 { cause: JSONRPCError2: Request timeout request took longer than 15000 ms to resolve at new JSONRPCError2 (/tmp/.mount_enso-leMqqdS/resources/app.asar/index.cjs:26822:30) at Timeout._onTimeout (/tmp/.mount_enso-leMqqdS/resources/app.asar/index.cjs:26985:20) at listOnTimeout (node:internal/timers:569:17) at process.processTimers (node:internal/timers:512:7) { code: 7777, data: undefined }, request: 'text/closeFile', params: { path: { rootId: '00000000-0000-0000-0000-000000000001', segments: [Array] } } } ```
Hubert Plociniczak reports a new STANDUP for the provided date (2024-05-08): Progress: Continued an effort to pinpoint the change that broke cloud support. Looks like it is mostly cloud-setup dependent as sometimes gui build works and sometimes not. Waiting on profiling data and more logs info to confirm. In the meantime continued my work on #9736, trying to get a draft PR out. It should be finished by 2024-05-09. Next Day: Next day I will be working on the #9789 task. Continue the investigation. |
Hubert Plociniczak reports a new STANDUP for the provided date (2024-05-09): Progress: Draft PR for #9736, waiting on the option to gather cloud profile data. It should be finished by 2024-05-09. Next Day: Next day I will be working on the #9789 task. Continue the investigation. |
Hubert Plociniczak reports a new 🔴 DELAY for the provided date (2024-05-10): Summary: There is 6 days delay in implementation of the Cannot create a project in cloud (#9789) task. Delay Cause: Still haven't figured out why cloud setup affects performance. Unable to reproduce problems locally. |
Hubert Plociniczak reports a new STANDUP for the provided date (2024-05-10): Progress: Analyzing profiling data, adding more logs, reducing potential bottlenecks to unclog cloud setup (#9927, #9915) It should be finished by 2024-05-15. Next Day: Next day I will be working on the #9789 task. Continue the investigation. |
Hubert Plociniczak reports a new STANDUP for the provided date (2024-05-13): Progress: Addressing review on PRs, still trying to investigate bottlenecks based on the logs and profiling data. It should be finished by 2024-05-15. Next Day: Next day I will be working on the #9789 task. Continue the investigation. |
Setting execution environment to the existing one should have no effect. Should (positively) affect startup in #9789. # Important Notes Cancelling jobs and triggering a fresh execute job is expensive and unnecessary, especially on startup, when the result should be the same as before.
Setting execution environment to the existing one should have no effect. Should (positively) affect startup in #9789. # Important Notes Cancelling jobs and triggering a fresh execute job is expensive and unnecessary, especially on startup, when the result should be the same as before.
Regarding websocket being randomly closed on startup I was able to reproduce the problem by adding a simple The conclusion is that I believe the server is simply overloaded on startup and slow to respond to a new websocket connection request, leading to some kind of timeout on the GUI side. On a subsequent try from GUI it typically connects without any problems so the problem can be more or less ignored. The performance in the cloud, given the current resources, is still unsatisfactory and can lead to such random failures. We don't want to "recover" from such issues, we simply don't want to have them to appear in the first place.
|
I finally managed to get hold of profiling data from a low spec machine that more closely resembles users' and current cloud spec. |
Add to `-debug.profile` on startup to turn on the profiler on a local machine and dump data to `profiling.npss` file. Previously not possible in a standalone setup. Useful for #9789.
Add `-debug.profile` on startup to turn on the profiler on a local machine and dump data to `profiling.npss` file. Previously not possible in a standalone setup. Useful for #9789.
Hubert Plociniczak reports a new 🔴 DELAY for today (2024-05-20): Summary: There is 7 days delay in implementation of the Cannot create a project in cloud (#9789) task. Delay Cause: Regressions when testing on low spec machine. Temporarily moved on to work on other issues. |
Hubert Plociniczak reports a new STANDUP for today (2024-05-20): Progress: Finding culprit change for #9993. Continued work on replacing jackson to potentially speedup startup, based on the collected profiling data. Devising plan on further improvements to startup. It should be finished by 2024-05-22. Next Day: Next day I will be working on the #9789 task. Continue the investigation. |
Hubert Plociniczak reports a new STANDUP for the provided date (2024-05-16): Progress: Not much progress with gathering profiling data from cloud. Switched to a local low-spec machine. It should be finished by 2024-05-22. Next Day: Next day I will be working on the #9789 task. Continue the investigation. |
Hubert Plociniczak reports a new STANDUP for the provided date (2024-05-17): Progress: Added profiling option to AppImage so that profiling data can be collected from a local setup. Found a recent regression on startup and narrowed it down to a range of commits. Analyzing collected data and trying to come up with some perf improvements. It should be finished by 2024-05-22. Next Day: Next day I will be working on the #9789 task. Continue the investigation. |
Hubert Plociniczak reports a new STANDUP for yesterday (2024-05-21): Progress: Still tinkering with jackson replacement. Dealing with non-deterministic bug in deserialization in jsoniter. Will file a bug upstream. It should be finished by 2024-05-22. Next Day: Next day I will be working on the #9789 task. Continue the investigation. |
Hubert Plociniczak reports a new 🔴 DELAY for the provided date (2024-06-06): Summary: There is 19 days delay in implementation of the Cannot create a project in cloud (#9789) task. Delay Cause: Draft PR was parked until refactoring of ModuleScope was finished. |
Hubert Plociniczak reports a new STANDUP for the provided date (2024-06-06): Progress: Bringing old PR up-to-date. Trying workarounds for some encoding issues with Option values. Initial performance results are very promising. It should be finished by 2024-06-10. Next Day: Next day I will be working on the #9789 task. Continue integrating new JSON serde |
Hubert Plociniczak reports a new STANDUP for the provided date (2024-06-07): Progress: Testing the final version of PR. Confirmed performance results. Removing jackson whenever possible. Added a workaround for Option values that caused issues when deserializing them. It should be finished by 2024-06-10. Next Day: Next day I will be working on the #9789 task. Deal with licensing, finish removing jackson |
Hubert Plociniczak reports a new STANDUP for the provided date (2024-06-10): Progress: Finished replacement of jackson to jsoniter. Started looking into performance issues in the cloud #10231. Profiling startup. It should be finished by 2024-06-10. Next Day: Next day I will be working on the #10231 task. Continue investigating startup issues |
* Add an option to profile AppImage Add `-debug.profile` on startup to turn on the profiler on a local machine and dump data to `profiling.npss` file. Previously not possible in a standalone setup. Useful for #9789. * fix linter * fix linter warning * more linting * lint
This is currently blocked on Ydoc integration in the cloud. |
|
Discord username
farmaazon
What type of issue is this?
Permanent – Occurring repeatably
Is this issue blocking you from using Enso?
Is this a regression?
What issue are you facing?
When I create a new project in the cloud, it loads but does not show any nodes. When I stop the project and re-run, it does not open even after long waiting (10 minutes).
Expected behaviour
I see the starting node after creating new project; running it should open it in a reasonable time.
How we can reproduce it?
No response
Screenshots or screencasts
No response
Logs
No response
Enso Version
a786ad2
Browser or standalone distribution
Standalone distribution (cloud project)
Browser Version or standalone distribution
standaolne
Operating System
Linux
Operating System Version
Garuda
Hardware you are using
No response
The text was updated successfully, but these errors were encountered: