Skip to content
This repository has been archived by the owner on Mar 11, 2022. It is now read-only.

Entity throws error after 30 seconds of inactivity #27

Closed
chongkai opened this issue Jun 11, 2020 · 11 comments · Fixed by #28
Closed

Entity throws error after 30 seconds of inactivity #27

chongkai opened this issue Jun 11, 2020 · 11 comments · Fixed by #28
Assignees
Labels
bug Something isn't working

Comments

@chongkai
Copy link

Hi,

I'm playing around with the shopping-cart example. After I send a command with a new user-id, the entity is created. If I keep sending commands to it, everything works fine. But if I stop for 30 seconds and send another command, I see a 500 Internal server error returned from sidecar. In the sidecar logs, I see the following:

2020-06-10 06:02:04.370 ERROR io.cloudstate.proxy.Serve$ - User Function responded with a failure: Incorrect command id in reply, expecting 1 but got 0
2020-06-10 06:02:04.374 ERROR akka.actor.ActorSystemImpl - Internal server error, sending 500 response
java.util.NoSuchElementException: head of empty stream
at akka.stream.scaladsl.Sink$.$anonfun$head$3(Sink.scala:198)
at scala.Option.getOrElse(Option.scala:189)
at akka.stream.scaladsl.Sink$.$anonfun$head$2(Sink.scala:198)
at scala.util.Success.$anonfun$map$1(Try.scala:255)
at scala.util.Success.map(Try.scala:213)
at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:56)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:74)
at akka.dispatch.internal.SameThreadExecutionContext$$anon$1.unbatchedExecute(SameThreadExecutionContext.scala:21)
at akka.dispatch.BatchingExecutor.execute(BatchingExecutor.scala:123)
at akka.dispatch.BatchingExecutor.execute$(BatchingExecutor.scala:117)
at akka.dispatch.internal.SameThreadExecutionContext$$anon$1.execute(SameThreadExecutionContext.scala:20)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)
at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288)
at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288)
at scala.concurrent.Promise.trySuccess(Promise.scala:94)
at scala.concurrent.Promise.trySuccess$(Promise.scala:94)
at scala.concurrent.impl.Promise$DefaultPromise.trySuccess(Promise.scala:187)
at akka.stream.impl.HeadOptionStage$$anon$3.onUpstreamFinish(Sinks.scala:236)
at akka.stream.impl.fusing.GraphInterpreter.processEvent(GraphInterpreter.scala:523)
at akka.stream.impl.fusing.GraphInterpreter.execute(GraphInterpreter.scala:390)
at akka.stream.impl.fusing.GraphInterpreterShell.runBatch(ActorGraphInterpreter.scala:625)
at akka.stream.impl.fusing.GraphInterpreterShell$AsyncInput.execute(ActorGraphInterpreter.scala:502)
at akka.stream.impl.fusing.GraphInterpreterShell.processEvent(ActorGraphInterpreter.scala:600)
at akka.stream.impl.fusing.ActorGraphInterpreter.akka$stream$impl$fusing$ActorGraphInterpreter$$processEvent(ActorGraphInterpreter.scala:769)
at akka.stream.impl.fusing.ActorGraphInterpreter.akka$stream$impl$fusing$ActorGraphInterpreter$$shortCircuitBatch(ActorGraphInterpreter.scala:759)
at akka.stream.impl.fusing.ActorGraphInterpreter$$anonfun$receive$1.applyOrElse(ActorGraphInterpreter.scala:785)
at akka.actor.Actor.aroundReceive(Actor.scala:535)
at akka.actor.Actor.aroundReceive$(Actor.scala:533)
at akka.stream.impl.fusing.ActorGraphInterpreter.aroundReceive(ActorGraphInterpreter.scala:691)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:577)
at akka.actor.ActorCell.invoke(ActorCell.scala:547)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
at akka.dispatch.Mailbox.run(Mailbox.scala:231)
at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)
at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)

@marcellanz
Copy link
Contributor

Hi @chongkai, thank you for your report on this issue.

Can you confirm a few details to get context?

  • Do you use the unmodified shopping cart example of the project here for this?
  • How do you deploy the user function?

@marcellanz marcellanz self-assigned this Jun 11, 2020
@chongkai
Copy link
Author

Hi @marcellanz,

I followed the go-support reference document https://cloudstate.io/docs/go/current/ to build this app, so it's not exactly the same as the tck code.

Here is my repo, https://github.com/chongkai/shopping-cart-go. I use the Dockerfile to build an image, and use the deploy.yml file to deploy to a local k8s. I have installed the cloudstate operator before that.

Thanks for looking into it.
Chongkai

@marcellanz
Copy link
Contributor

@chongkai thanks for the details! I'll check what happens here.

@marcellanz
Copy link
Contributor

@chongkai I'll have to mention that we have some refactoring and additions since v0.1.0 currently ongoing with additional CRDT support on a feature branch. That should be available soon. But what you have reported should not happen either I think.

@marcellanz
Copy link
Contributor

marcellanz commented Jun 13, 2020

Thanks again @chongkai for your report :)
I've written below what happens and hope the fix will work on your side too.

What happens

Whenever a gRPC stream is cancelled, v0.1.0 of the Go Language Support Library does not remove the entity context and a subsequent init message + a snapshot for the same entity will result in an error.

This issue is triggered by the entity passivation which is set to 30 seconds by cloudstate.proxy.passivation-timeout and after the last message was sent to the proxy, it will reset the stream (http2>RST_STREAM) and the user support grpc context and its stream gets cancelled.

The reported error Incorrect command id in reply, expecting 1 but got 0 roots in an another bug, where some reported failures do not correlate with the command received, in this case the unexpected existence of a entity context, because it was not removed on stream cancellation. This will be fixed in the next release. The failure message with a EventSourcedStreamOut.message.Failure.command_id has the default value, 0 in this case, and a failure reported without a command_id gets reported with 0.

The script issue27.sh below reproduces the issue and validates the fix.

What should happen

cloudstate.EventSourcedServer.Handle should remove any entity context whenever a stream is cancelled, for whatever reason.

PR#28 should fix this with a minimal but not optimal change. The coming milestone – v0.2.0 of the Go Language support will handle the stream differently and this bug will be fixed properly. There is also a race on the context map we'll fix in the next milestone too.

The verification script below can be used without any dependencies other than docker. The fix is published as 0.1.1 for the TCK docker image already.

> ./issue27.sh 0.1.0
Error invoking method "com.example.shoppingcart.ShoppingCart/AddItem": grpc call for "com.example.shoppingcart.ShoppingCart.AddItem" failed: EOF
> ./issue27.sh 0.1.1
{
  "items": [
    {
      "productId": "e-bike",
      "quantity": 15
    }
  ]
}

issue27.sh

#!/usr/bin/env bash

readonly version=${1:-0.1.0}
finally() {
  docker rm -f issue27_proxy
  docker rm -f issue27_tck
}
trap finally EXIT
set -x
#
# run the shopping cart
docker run --name issue27_tck -d -p 8080:8080 cloudstateio/cloudstate-go-tck:${version}
#
# configure entity passivation to be 3 seconds
docker run --name issue27_proxy -d -e USER_FUNCTION_HOST=host.docker.internal -e USER_FUNCTION_PORT=8080 -p 9000:9000 cloudstateio/cloudstate-proxy-dev-mode -Dcloudstate.proxy.passivation-timeout=3s
sleep 5
docker run -it --rm --net=host fullstorydev/grpcurl -max-time 30 -plaintext localhost:9000 list #describe com.example.shoppingcart.ShoppingCart.GetCart
#
# add 7 e-bikes
docker run -it --rm --net=host fullstorydev/grpcurl -plaintext -d '{"user_id":"mrcl", "product_id": "e-bike", "quantity": 7}' localhost:9000 com.example.shoppingcart.ShoppingCart/AddItem
#
# wait to get the entity be passivated by the proxy
sleep 5
docker logs issue27_proxy
#
# add another 8 e-bikes
docker run -it --rm --net=host fullstorydev/grpcurl -plaintext -d '{"user_id":"mrcl", "product_id": "e-bike", "quantity": 8}' localhost:9000 com.example.shoppingcart.ShoppingCart/AddItem
docker logs issue27_proxy
#
# get the shoppping cart
docker run -it --rm --net=host fullstorydev/grpcurl -plaintext -d '{"user_id":"mrcl"}' localhost:9000 com.example.shoppingcart.ShoppingCart/GetCart

On the proxy passivation, the proxy logs this INFO:

2020-06-13 16:45:30.830 INFO akka.actor.LocalActorRef - Message [akka.actor.Status$Failure] to Actor[akka://cloudstate-proxy/system/sharding/ShoppingCart/50/mrcl#81939500] was not delivered. [1] dead letters encountered. If this is not an expected behavior then Actor[akka://cloudstate-proxy/system/sharding/ShoppingCart/50/mrcl#81939500] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.

@chongkai
Copy link
Author

Hi @marcellanz, Thanks for the quick fix! Please keep up the good work.

I'm also looking at using cloudstate in production, any idea how long it would be before its' possible?

@marcellanz
Copy link
Contributor

thanks @chongkai. I quote @jroper for the same question where he stated recently on the mailinglist about production readiness:

It's hard to say - it's still under very active development right now. From Lightbend's perspective, we want to get real user feedback before saying its production ready, for this reason we've been working on a hosted version of it (afterall, it's not really serverless if you have to host it yourself) so that we can get people on it and using it, and that will help us to understand use cases, and whether the current feature set makes sense, what's missing, where we should prioritise our efforts.

As you can see we are on low development versions with different user support libraries and we need some time to get them be used. Your bug report clearly was very helpful and we're looking forward to any feedback we can get. Just let us know if we can help or if you have any other questions here, on Gitter or our Mailinglist.

@jroper
Copy link
Member

jroper commented Jun 15, 2020

@marcellanz If I understand your analysis correctly, the go support has a global map somewhere of entity ids (and service names?) that it uses for the context, is that right? I would recommend against this - in future, we may add features to Cloudstate where we start up additional versions of the same entity, for example, we might offer some built in temporal loading support, where users can say "run this command against this entity in the state that it was in at this point of time in history". To do that, we would create a new entity context for that, even if an existing one already existed on that node, and replay the journal up until that time before running the command.

In the JavaScript and Java support, the entity context is constructed when the gRPC stream is created, and then closed over by the callbacks that are then attached to that gRPC stream.

@marcellanz
Copy link
Contributor

@jroper That's right, yes. It will be changed with the next milestone release, where a runner holds the context for an entity for the whole time its stream is alive.

This is already implemented this way in the CRDT feature branch where I started to "run" entities this way and I will adapt that to eventsourcing too.

marcellanz added a commit that referenced this issue Jun 15, 2020
fixes #27 – remove entity context for a cancelled stream.
@marcellanz
Copy link
Contributor

@chongkai we released v0.1.1 a few minutes ago:
https://pkg.go.dev/github.com/cloudstateio/[email protected]/cloudstate
you should be able to go get -u it.

I verified your repo to be working with this release and I hope it does for you too.

@chongkai
Copy link
Author

@marcellanz, yes it works on my side now. Thanks for the quick turnaround!

marcellanz referenced this issue in marcellanz/go-support Jun 20, 2020
…rom now on. fix of go-support #27. EntityInstanceContexts are gone.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants