-
-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gaps in mt_events sequence causing subscriptions to be paused #3503
Comments
" Is it intended that the async daemon should gracefully recover from such gaps in the sequence because it seems inevitable in our case that these will happen?" Yes, but if the app can't connect to postgres -- which it looks like it was unable to do so from your error messages -- there's nothing that Marten could do. What are you wanting to happen here? I can't say that there's many reports of this behavior. This might be just the 2nd, and in the first case, I know there was a lot of other wacky stuff going on too. "To fix this we had to manually set the HighWaterMark to 31382 and all the subscriptions to 30051 and then restart the async daemon." -- try not to mess with that table if you can help it. I would have thought that restarting the app would have allowed the high water mark to skip over the gaps after a little bit of time. Also, you might look into the "quick append" option for the event store to avoid having so many gaps in the first place. That might also help. |
Also, did Marten keep logging connectivity errors, or does it appear that the high water mark detection just stopped in this case (which it most certainly should not ever do so)? |
The problem is that even after the connection to postgres has been restored (the maintenance lasts for 1-2 minutes typically), the async daemon does not detect the new high water mark until restarted.
It just seems to stop. There are no log lines after the 06:02:38.174 one posted above. I just tested locally by stopping postgres running in docker and issuing an
And the original one from my issue which is running in aks
As you can see the difference is
We are using quick append already. The sequence gap seems to be caused solely by the maintenance (is this a thing?) and not by failed transactions because there are no attempted event writes around this time. |
"The sequence gap seems to be caused solely by the maintenance (is this a thing?)" -- yeah, totally see that being an issue if you shut down things hard w/o doing a graceful stop We've had reports of |
Yes, I was able to do that locally and the async daemon picked up from the sequence id after the gap once postgres was back up. I did only try to reproduce that once and it worked as expected and continued processing subscriptions after the gap. I don't know if this has some specific reproduction steps, maybe specific to azure, but I'm still trying to reproduce it in a controlled way. |
We experienced a problem this morning where unavoidable maintenance on our azure postgres database caused a gap in seq_id values in the mt_events table. This caused the high water mark to never make it above the gap and subscriptions to be paused as a result.
As the image shows there is a gap between 30041 and 30051.
The logs around this time show a connection issue to the postgres database between 06:01:38 and 06:02:36 and then some indication about a gap was detected but when checking the mt_event_progression 6 hours later we found that the HighWaterMark and all of the subscriptions were stuck at 30041. They didn't proceed from 30051 up to the latest sequence id 31382 as expected.
To fix this we had to manually set the HighWaterMark to 31382 and all the subscriptions to 30051 and then restart the async daemon. Is it intended that the async daemon should gracefully recover from such gaps in the sequence because it seems inevitable in our case that these will happen?
The text was updated successfully, but these errors were encountered: