Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][Priority] RC sometimes not sending zero timer payloads? #1158

Closed
ear-dev opened this issue Apr 27, 2022 · 8 comments · Fixed by #1251
Closed

[Bug][Priority] RC sometimes not sending zero timer payloads? #1158

ear-dev opened this issue Apr 27, 2022 · 8 comments · Fixed by #1251
Assignees

Comments

@ear-dev
Copy link

ear-dev commented Apr 27, 2022

I noticed that some sessions on the brazil bot were getting no response at times from the DF.agent.

https://ba.chatbot.vega.viasat.com/live/sBkcCy9aeq3ujxdpq/room-info
https://ba.chatbot.vega.viasat.com/live/QBdfwBdq657aza5mC/room-info
https://ba.chatbot.vega.viasat.com/live/gtP8QotNBXLEf746e/room-info

Molly suggested that DF was expecting a zero timer payload which never arrived from RC.

hey, so i am digging into the first session you sent, and https://ba.chatbot.vega.viasat.com/live/sBkcCy9aeq3ujxdpq is getting stuck on a particular page in CX that expects a RC zero timer event - but RC didn’t send the event
8:35
not sure why as that flow has been workign fine for me every time i’ve tested
8:37
is there any way to see why RC is not sending that event? it seems like a weird intermittent issue?
8:38

@Shailesh351 I will assign to you...... maybe there is something in the RC logs?

@ear-dev
Copy link
Author

ear-dev commented Jun 16, 2022

NOTE: Molly will provide Shailesh access to the CX logs for debugging. There are a few rooms with this issue pasted above, but I also use this one: Fnz8PEtSYcMR5Zc6Y

@Shailesh351 already has access to the CX agent so he can look at how the boleto intent and boletoPause events are handled. We have other zero timer payloads like registration that do not ever show this behavior where the room gets stuck because CX is not getting the pause event.

One diff might be that the other payloads also include a "please wait" text message, but we currently do not think that RC handles that any differently.

Another possibility might be that RC is somehow not handling an HTTP error message in this case? Basically from the RC side, we are not seeing any errors showing up in our logs around this event.

@ear-dev ear-dev changed the title [Bug] RC sometimes not sending zero timer payloads? [Bug][Priority] RC sometimes not sending zero timer payloads? Jun 23, 2022
@ear-dev
Copy link
Author

ear-dev commented Jun 23, 2022

@Shailesh351 will find succesfull sessions where the visitor sent 'boleto' text, and compare with our frozen sessions. We may need log points to figure this out.

Molly can provide list of sessions with the issue...... NOTE: Debug logging was on June 21-22. So sessions have to be from then.

@ear-dev
Copy link
Author

ear-dev commented Jul 1, 2022

NOTE: After our recent upgrade in prod to RC version v4.4.2.widechat-4 and DF version v1.2.3.widechat-6 this issue seems to have fixed itself.... we will leave this story under review for another week and close if we do not see a repro.

@ear-dev ear-dev closed this as completed Jul 5, 2022
@ear-dev ear-dev reopened this Jul 6, 2022
@ear-dev
Copy link
Author

ear-dev commented Jul 6, 2022

@Shailesh351 @bhardwajaditya looks like this issue may have returned in prod. This time it was 'verificationPause', where CX was waiting for the event. The event is eventually showing up, but 40 minutes late or so. Somehow our task scheduling is getting stuck?

@ear-dev
Copy link
Author

ear-dev commented Jul 7, 2022

General Notes:

  • What would cause the scheduled event to hang up? Which process responsible for sending it.
  • Load?
  • Stuck threads?
  • Best effort events or guaranteed? How handled.
  • Retry, timing?
  • Latency on events? Characterize…. Min, max, average
  • How do events scale….. do they back up?

Who's responsible for the event scheduling: DF.app -> writes to DB -> appBridge -> RC.server

  • could appBridge be queueing and blocking?
  • could we be failing to do DB writes?
  • Failed scheduled events should throw an error: "The App appID is scheduling an onetime job processor ID ", "The App appId is scheduling a recurring job processor ID"

What is the lifecycle of a scheduled event?? RC docs? @bhardwajaditya can you help me document all the different states that a scheduled event will go through.

@ear-dev
Copy link
Author

ear-dev commented Jul 7, 2022

@bhardwajaditya searching for the log point where a job is getting scheduled does not help because there is no identifying data associated with it. I think we should make a story to add the roomID to these log points. What do you think?

@ear-dev
Copy link
Author

ear-dev commented Jul 14, 2022

NOTE: we see three flavors of this bug

  • The event never gets executed and our visitors are stuck in a blackout window
  • The event gets triggered 10 minutes..... or 2 hours later and CX is totally confused
  • The event gets triggered 10 minutes later and gets executed twice (maybe a few minutes apart even)

@Shailesh351 can you please look at latest RC server upstream to see if they may have a fix that we're missing? Thanks.

@ear-dev
Copy link
Author

ear-dev commented Jul 26, 2022

@Shailesh351 I've been testing this build and it is currently failing our 'Multiple "continue_blackout" message dropping payloads in a row ' test, described in this wiki: https://wiki.viasat.com/pages/viewpage.action?pageId=549170025

Can you verify please? thanks...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants