Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot create a project in cloud #9789

Closed
2 tasks done
farmaazon opened this issue Apr 25, 2024 · 42 comments · Fixed by #9858
Closed
2 tasks done

Cannot create a project in cloud #9789

farmaazon opened this issue Apr 25, 2024 · 42 comments · Fixed by #9858
Assignees
Labels
--bug Type: bug --regression Important: regression -language-server p-highest Should be completed ASAP

Comments

@farmaazon
Copy link
Contributor

Discord username

farmaazon

What type of issue is this?

Permanent – Occurring repeatably

Is this issue blocking you from using Enso?

  • Yes, I can't use Enso because of this issue.

Is this a regression?

  • Yes, previous version of Enso did not have this issue.

What issue are you facing?

When I create a new project in the cloud, it loads but does not show any nodes. When I stop the project and re-run, it does not open even after long waiting (10 minutes).

Expected behaviour

I see the starting node after creating new project; running it should open it in a reasonable time.

How we can reproduce it?

No response

Screenshots or screencasts

No response

Logs

No response

Enso Version

a786ad2

Browser or standalone distribution

Standalone distribution (cloud project)

Browser Version or standalone distribution

standaolne

Operating System

Linux

Operating System Version

Garuda

Hardware you are using

No response

@PabloBuchu
Copy link
Contributor

So there is bunch of errors in messages and logs from language server. @hubertp could you take a look? @somebody1234 have anything changed in communication with the server?

Screenshot 2024-04-29 at 13 54 05 Screenshot 2024-04-29 at 13 56 23 Screenshot 2024-04-29 at 13 55 18

@hubertp
Copy link
Collaborator

hubertp commented Apr 29, 2024

Received InboundMessage with unknown payload type [1] looks like a culprit.

@hubertp hubertp added the p-highest Should be completed ASAP label Apr 29, 2024
@somebody1234
Copy link
Contributor

seems like this is the reason why:

@farmaazon
Copy link
Contributor Author

seems like this is the reason why:

I'm doubtful.

The flatbuffer's format does not change between different versions of flatc - so the format emitted by code generated with never version of flatc should be happily read by code generated by older version, if the schema files were the same - and those haven't been touched from a long time.

And the fact, that the local projects are working properly seems to confirm it. Any issue with binary connection should be visible when running local projects too. But this does not happen.

@somebody1234
Copy link
Contributor

ah, weird... in that case i'm not sure why payload type 1 is an error then:

INIT_SESSION_CMD = 1,

since it seems to be the session initialization payload...

@somebody1234
Copy link
Contributor

so it seems line engine does not handle duplicate INIT_SESSION messages.

private def createRequestHandlers(
outboundChannel: ActorRef
): Map[InboundPayloadType, Props] = {
Map(
WRITE_FILE_CMD -> WriteBinaryFileHandler
.props(requestTimeout, fileManager, outboundChannel),
READ_FILE_CMD -> ReadBinaryFileHandler
.props(requestTimeout, fileManager, outboundChannel),
CHECKSUM_BYTES_CMD -> ChecksumBytesHandler
.props(requestTimeout, fileManager, outboundChannel),
WRITE_BYTES_CMD -> WriteBytesHandler
.props(requestTimeout, fileManager, outboundChannel),
READ_BYTES_CMD -> ReadBytesHandler
.props(requestTimeout, fileManager, outboundChannel)
)
}

i assume then, that this is due to the same 15 second timeout issue from before?

@somebody1234
Copy link
Contributor

this is also suspicious:
image

maybe the dashboard is starting GUI2 twice...?

@hubertp
Copy link
Collaborator

hubertp commented Apr 29, 2024

In my case I'm getting a blank page as well but no errors on backend. Maybe one of the console errors explains?
Screenshot from 2024-04-29 15-40-18
Screenshot from 2024-04-29 15-38-31

@somebody1234
Copy link
Contributor

i assume cloud.enso.org still doesn't work as it's still using GUI1?

@hubertp
Copy link
Collaborator

hubertp commented Apr 30, 2024

i assume cloud.enso.org still doesn't work as it's still using GUI1?

Whaaa? Well that would explain things. But I talked to @PabloBuchu yesterday and he said it is running develop.

@somebody1234
Copy link
Contributor

hmm. i guess it's running develop, but the logic to fetch GUI from S3 has not been changed - and i guess S3 has not been updated with GUI2

it's worth noting that cloud.enso.org does not work anyway (afaict), because we do not yet have the Y.js server (for syncing GUI state and text file state) running on the cloud.

@hubertp
Copy link
Collaborator

hubertp commented May 2, 2024

I'm seeing (with trace level logs) something suspicious:

May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]: [TRACE] [2024-05-02T15:40:30Z] [org.enso.librarymanager.local.DefaultLocalLibraryProvider] Local library Standard.Base not found at [***/libraries].
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]: [TRACE] [2024-05-02T15:40:30Z] [org.enso.librarymanager.local.DefaultLocalLibraryProvider] Failed to load the candidate library package description at [***/data_root].
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]: org.enso.pkg.PackageManager$PackageNotFound: The package file does not exist.
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at org.enso.runtime/org.enso.pkg.PackageManager.loadPackage(Package.scala:348)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at org.enso.runtime/org.enso.librarymanager.local.DefaultLocalLibraryProvider.$anonfun$findCandidates$2(DefaultLocalLibraryProvider.scala:79)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:178)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1939)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:575)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at java.base/java.util.stream.AbstractPipeline.evaluateToArrayNode(AbstractPipeline.java:260)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at java.base/java.util.stream.ReferencePipeline.toArray(ReferencePipeline.java:616)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at java.base/java.util.stream.ReferencePipeline.toArray(ReferencePipeline.java:622)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at java.base/java.util.stream.ReferencePipeline.toList(ReferencePipeline.java:627)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at org.enso.runtime/org.enso.librarymanager.local.DefaultLocalLibraryProvider.findCandidates(DefaultLocalLibraryProvider.scala:77)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at org.enso.runtime/org.enso.librarymanager.local.DefaultLocalLibraryProvider.findLibraryHelper(DefaultLocalLibraryProvider.scala:39)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at org.enso.runtime/org.enso.librarymanager.local.DefaultLocalLibraryProvider.findLibrary(DefaultLocalLibraryProvider.scala:24)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at org.enso.runtime/org.enso.librarymanager.LibraryResolver.resolveLibraryVersion(LibraryResolver.scala:36)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at org.enso.runtime/org.enso.librarymanager.DefaultLibraryProvider.findLibrary(DefaultLibraryProvider.scala:52)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at org.enso.runtime/org.enso.interpreter.runtime.DefaultPackageRepository.ensurePackageIsLoaded(DefaultPackageRepository.scala:411)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at org.enso.runtime/org.enso.compiler.phase.ImportResolverForIR.loadLibraryModule(ImportResolverForIR.java:124)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at org.enso.runtime/org.enso.compiler.phase.ImportResolverForIR.loadLibraryModule(ImportResolverForIR.java:23)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at org.enso.runtime/org.enso.compiler.phase.ImportResolverAlgorithm.tryResolveImportNew(ImportResolverAlgorithm.java:141)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at org.enso.runtime/org.enso.compiler.phase.ImportResolverAlgorithm.tryResolveImport(ImportResolverAlgorithm.java:72)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at org.enso.runtime/org.enso.compiler.phase.ImportResolver.$anonfun$mapImports$3(ImportResolver.scala:74)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at org.enso.runtime/scala.collection.immutable.List.map(List.scala:246)
May 02 15:40:30 ip-172-31-0-127.eu-west-1.compute.internal enso_runtime[2573]:         at org.enso.runtime/org.enso.compiler.phase.ImportResolver.analyzeModule$1(ImportResolver.scala:72)

As if all libraries were gone

@hubertp
Copy link
Collaborator

hubertp commented May 3, 2024

Actually that doesn't seem to be the cause of the problem. I will need to investigate why it happens in the first place but it seems to recover from this problem.

@hubertp
Copy link
Collaborator

hubertp commented May 3, 2024

Not sure how relevant that is but why is GUI reporting a wrong version in cloud? I thought it's using latest nightly?
Screenshot from 2024-05-03 13-01-54

@jdunkerley
Copy link
Member

jdunkerley commented May 3, 2024

hmm. i guess it's running develop, but the logic to fetch GUI from S3 has not been changed - and i guess S3 has not been updated with GUI2

it's worth noting that cloud.enso.org does not work anyway (afaict), because we do not yet have the Y.js server (for syncing GUI state and text file state) running on the cloud.

We have had the GUI2 running with a cloud backend in electron. Yes we have to run the y.js server locally but it was working.

@somebody1234
Copy link
Contributor

yup - by "cloud.enso.org" i mean specifically the website hosted at cloud.enso.org.

when we run through electron (and on the browser of the electron server!) i believe we do use the local version of the GUI, as the check for whether we need to download the GUI, is determined by whether the Local Backend (the Project Manager backend) is supported

@hubertp
Copy link
Collaborator

hubertp commented May 3, 2024

So I have a fix for the timeout that is being reported when using electron app and connection to the cloud. The issue is basically... weak hardware. On my machine first compilation takes max 3-4 seconds on a bad day. In the cloud setup it takes at least 25 seconds alone. Given that we have timeouts for serving API requests set to 5 seconds, the problem becomes rather common.

When running via browser, I believe it is not running the latest GUI? At least the type of requests seem a bit outdated and different from what electron (nightly) is doing.

mergify bot pushed a commit that referenced this issue May 6, 2024
When `PROFILING_FILENAME` and `PROFILING_TIME` are set, language server will collect profiling data on startup and place it under `/opt/enso/profiling/$PROFILING_NAME` where it can be fetched from.

Needed to better analyze #9789.
@mergify mergify bot closed this as completed in #9858 May 6, 2024
mergify bot pushed a commit that referenced this issue May 6, 2024
Follow up to #9558, this time to `SetExecutionContextEnvironmentHandler` that was timing out in #9789.
Added a base classes that handles the repeatable logic.

Maybe it will close #9789.
@github-project-automation github-project-automation bot moved this from ❓New to 🟢 Accepted in Issues Board May 6, 2024
@enso-bot
Copy link

enso-bot bot commented May 6, 2024

Hubert Plociniczak reports a new STANDUP for the provided date (2024-05-03):

Progress: Paused work on #9736 to investigate cloud issues. Looks like slow compilations are delaying handling of requests leading to timeouts. Added workaround in multiple handlers so that we don't return timeout immediately. It should be finished by 2024-05-06.

Next Day: Next day I will be working on the #9789 task. Continue the investigation. Figure out if we can profile the startup to find the root cause.

mergify bot pushed a commit that referenced this issue May 9, 2024
I'm seeing occasional IO timeouts, especially on startup operations, for cloud projects. Adding some logging to make an informed decision if there are some problems there.

Related to #9789

# Important Notes
Also added retries when closing the file as I saw a number of times:
```
Session release failed.
LsRpcError: Language server request 'text/closeFile' failed.
at LanguageServer.request (/tmp/.mount_enso-leMqqdS/resources/app.asar/index.cjs:58291:15)
at async Promise.all (index 0)
at async _LanguageServerSession.release (/tmp/.mount_enso-leMqqdS/resources/app.asar/index.cjs:59165:5)
at async /tmp/.mount_enso-leMqqdS/resources/app.asar/index.cjs:59670:7 {
cause: JSONRPCError2: Request timeout request took longer than 15000 ms to resolve
at new JSONRPCError2 (/tmp/.mount_enso-leMqqdS/resources/app.asar/index.cjs:26822:30)
at Timeout._onTimeout (/tmp/.mount_enso-leMqqdS/resources/app.asar/index.cjs:26985:20)
at listOnTimeout (node:internal/timers:569:17)
at process.processTimers (node:internal/timers:512:7) {
code: 7777,
data: undefined
},
request: 'text/closeFile',
params: {
path: {
rootId: '00000000-0000-0000-0000-000000000001',
segments: [Array]
}
}
}
```
@enso-bot
Copy link

enso-bot bot commented May 10, 2024

Hubert Plociniczak reports a new STANDUP for the provided date (2024-05-08):

Progress: Continued an effort to pinpoint the change that broke cloud support. Looks like it is mostly cloud-setup dependent as sometimes gui build works and sometimes not. Waiting on profiling data and more logs info to confirm. In the meantime continued my work on #9736, trying to get a draft PR out. It should be finished by 2024-05-09.

Next Day: Next day I will be working on the #9789 task. Continue the investigation.

@enso-bot
Copy link

enso-bot bot commented May 12, 2024

Hubert Plociniczak reports a new STANDUP for the provided date (2024-05-09):

Progress: Draft PR for #9736, waiting on the option to gather cloud profile data. It should be finished by 2024-05-09.

Next Day: Next day I will be working on the #9789 task. Continue the investigation.

@enso-bot
Copy link

enso-bot bot commented May 13, 2024

Hubert Plociniczak reports a new 🔴 DELAY for the provided date (2024-05-10):

Summary: There is 6 days delay in implementation of the Cannot create a project in cloud (#9789) task.
It will cause 0 days delay for the delivery of this weekly plan.

Delay Cause: Still haven't figured out why cloud setup affects performance. Unable to reproduce problems locally.

@enso-bot
Copy link

enso-bot bot commented May 13, 2024

Hubert Plociniczak reports a new STANDUP for the provided date (2024-05-10):

Progress: Analyzing profiling data, adding more logs, reducing potential bottlenecks to unclog cloud setup (#9927, #9915) It should be finished by 2024-05-15.

Next Day: Next day I will be working on the #9789 task. Continue the investigation.

@enso-bot
Copy link

enso-bot bot commented May 15, 2024

Hubert Plociniczak reports a new STANDUP for the provided date (2024-05-13):

Progress: Addressing review on PRs, still trying to investigate bottlenecks based on the logs and profiling data. It should be finished by 2024-05-15.

Next Day: Next day I will be working on the #9789 task. Continue the investigation.

mergify bot pushed a commit that referenced this issue May 16, 2024
Setting execution environment to the existing one should have no effect.
Should (positively) affect startup in #9789.

# Important Notes
Cancelling jobs and triggering a fresh execute job is expensive and unnecessary, especially on startup, when the result should be the same as before.
vitvakatu pushed a commit that referenced this issue May 17, 2024
Setting execution environment to the existing one should have no effect.
Should (positively) affect startup in #9789.

# Important Notes
Cancelling jobs and triggering a fresh execute job is expensive and unnecessary, especially on startup, when the result should be the same as before.
@hubertp
Copy link
Collaborator

hubertp commented May 17, 2024

Regarding websocket being randomly closed on startup I was able to reproduce the problem by adding a simple Thread.sleep in https://github.com/enso-org/enso/blob/develop/engine/language-server/src/main/scala/org/enso/languageserver/http/server/BinaryWebSocketServer.scala#L71. Except in my case in the logs I was able to see that the connection was being closed.
Screenshot from 2024-05-17 15-59-02

The conclusion is that I believe the server is simply overloaded on startup and slow to respond to a new websocket connection request, leading to some kind of timeout on the GUI side. On a subsequent try from GUI it typically connects without any problems so the problem can be more or less ignored.

The performance in the cloud, given the current resources, is still unsatisfactory and can lead to such random failures. We don't want to "recover" from such issues, we simply don't want to have them to appear in the first place.

GitHub
Hybrid visual and textual functional programming. Contribute to enso-org/enso development by creating an account on GitHub.

@hubertp
Copy link
Collaborator

hubertp commented May 19, 2024

I finally managed to get hold of profiling data from a low spec machine that more closely resembles users' and current cloud spec.
Uploading profiling.npss.txt…
It looks like a considerable amount of time is spent in jackson serde.

hubertp added a commit that referenced this issue May 19, 2024
Add to `-debug.profile` on startup to turn on the profiler on a local
machine and dump data to `profiling.npss` file. Previously not possible
in a standalone setup.

Useful for #9789.
hubertp added a commit that referenced this issue May 19, 2024
Add `-debug.profile` on startup to turn on the profiler on a local
machine and dump data to `profiling.npss` file. Previously not possible
in a standalone setup.

Useful for #9789.
@enso-bot
Copy link

enso-bot bot commented May 20, 2024

Hubert Plociniczak reports a new 🔴 DELAY for today (2024-05-20):

Summary: There is 7 days delay in implementation of the Cannot create a project in cloud (#9789) task.
It will cause 0 days delay for the delivery of this weekly plan.

Delay Cause: Regressions when testing on low spec machine. Temporarily moved on to work on other issues.

@enso-bot
Copy link

enso-bot bot commented May 20, 2024

Hubert Plociniczak reports a new STANDUP for today (2024-05-20):

Progress: Finding culprit change for #9993. Continued work on replacing jackson to potentially speedup startup, based on the collected profiling data. Devising plan on further improvements to startup. It should be finished by 2024-05-22.

Next Day: Next day I will be working on the #9789 task. Continue the investigation.

@enso-bot
Copy link

enso-bot bot commented May 20, 2024

Hubert Plociniczak reports a new STANDUP for the provided date (2024-05-16):

Progress: Not much progress with gathering profiling data from cloud. Switched to a local low-spec machine. It should be finished by 2024-05-22.

Next Day: Next day I will be working on the #9789 task. Continue the investigation.

@enso-bot
Copy link

enso-bot bot commented May 20, 2024

Hubert Plociniczak reports a new STANDUP for the provided date (2024-05-17):

Progress: Added profiling option to AppImage so that profiling data can be collected from a local setup. Found a recent regression on startup and narrowed it down to a range of commits. Analyzing collected data and trying to come up with some perf improvements. It should be finished by 2024-05-22.

Next Day: Next day I will be working on the #9789 task. Continue the investigation.

@enso-bot
Copy link

enso-bot bot commented May 22, 2024

Hubert Plociniczak reports a new STANDUP for yesterday (2024-05-21):

Progress: Still tinkering with jackson replacement. Dealing with non-deterministic bug in deserialization in jsoniter. Will file a bug upstream. It should be finished by 2024-05-22.

Next Day: Next day I will be working on the #9789 task. Continue the investigation.

@enso-bot
Copy link

enso-bot bot commented Jun 9, 2024

Hubert Plociniczak reports a new 🔴 DELAY for the provided date (2024-06-06):

Summary: There is 19 days delay in implementation of the Cannot create a project in cloud (#9789) task.
It will cause 0 days delay for the delivery of this weekly plan.

Delay Cause: Draft PR was parked until refactoring of ModuleScope was finished.

@enso-bot
Copy link

enso-bot bot commented Jun 9, 2024

Hubert Plociniczak reports a new STANDUP for the provided date (2024-06-06):

Progress: Bringing old PR up-to-date. Trying workarounds for some encoding issues with Option values. Initial performance results are very promising. It should be finished by 2024-06-10.

Next Day: Next day I will be working on the #9789 task. Continue integrating new JSON serde

@enso-bot
Copy link

enso-bot bot commented Jun 10, 2024

Hubert Plociniczak reports a new STANDUP for the provided date (2024-06-07):

Progress: Testing the final version of PR. Confirmed performance results. Removing jackson whenever possible. Added a workaround for Option values that caused issues when deserializing them. It should be finished by 2024-06-10.

Next Day: Next day I will be working on the #9789 task. Deal with licensing, finish removing jackson

@enso-bot
Copy link

enso-bot bot commented Jun 12, 2024

Hubert Plociniczak reports a new STANDUP for the provided date (2024-06-10):

Progress: Finished replacement of jackson to jsoniter. Started looking into performance issues in the cloud #10231. Profiling startup. It should be finished by 2024-06-10.

Next Day: Next day I will be working on the #10231 task. Continue investigating startup issues

hubertp added a commit that referenced this issue Jun 13, 2024
* Add an option to profile AppImage

Add `-debug.profile` on startup to turn on the profiler on a local
machine and dump data to `profiling.npss` file. Previously not possible
in a standalone setup.

Useful for #9789.

* fix linter

* fix linter warning

* more linting

* lint
@hubertp
Copy link
Collaborator

hubertp commented Jul 8, 2024

This is currently blocked on Ydoc integration in the cloud.

@hubertp
Copy link
Collaborator

hubertp commented Jul 9, 2024

Partially related: #9993 (comment) Blocked by #10495

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
--bug Type: bug --regression Important: regression -language-server p-highest Should be completed ASAP
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants