Replies: 9 comments 15 replies
-
A problem with this concept: because its only approach to reinitializing is to wait for the "fixed delay" to elapse and start over, this means every reload event (like a Now I'm thinking, the event task should be a loop that only terminates when an unexpected exception is thrown. For expected exceptions, we immediately attempt reconnection. For unexpected exceptions, the task terminates, the delay time elapses, and then we try again. This is actually another pretty clean operating principle: the reconnection delay happens if and only if the reconnection loop terminates with an unexpected exception. And, when it terminates with an exception, that's probably a good time to discard the resume token: if things have gotten so bad that we need to pause and try to start over, we should start fresh. (We could also discard the resume token under other circumstances as appropriate.) So something like: ex.scheduleWithFixedDelay(this::eventLoop);
...
void eventLoop() {
try {
while (true) {
connect();
try {
while (true) {
processEvent(cursor.next());
}
} catch (CertainSpecificExceptions) {
// log, disconnect, and reconnect immediately
} finally {
disconnect();
}
}
} finally {
discardResumeToken();
}
} Some care needs to be taken that the outer loop doesn't go into a spin retrying the connection over and over: failures to connect really need to throw exceptions. |
Beta Was this translation helpful? Give feedback.
-
I should also mention: there's a (minor) challenge around implementing
|
Beta Was this translation helpful? Give feedback.
-
I like it!
|
Beta Was this translation helpful? Give feedback.
-
I thought I'd make a note of something here too. Today I came to wonder: why do we need a "disconnected state" where the driver uniformly throws an exception for all driver operations? Why can't each of those operations succeed or fail on their own merit? If the database is inaccessible, each submitted update / flush will fail anyway; if it's accessible, then great, the operations can succeed. I think the risk is in the case where the driver loses confidence in its ability to understand the database state. For example, if it can't determine the right format, or receives a The disconnected state may not be required for cases of interrupted connectivity, but it may be simpler to handle these cases the same way; after all, what is the value of avoiding a "disconnected state" when you can't reach the database anyway? |
Beta Was this translation helpful? Give feedback.
-
I think the mental shift of this idea is as follows. The "v2" design considers a live connection to MongoDB as being the "normal" state, and then attempts various corrective actions when things go wrong. The corrective actions could occur on various threads at various times, and must reach the desired state starting from whatever state they find themselves in. Nested corrections could even occur while another corrective action is still in progress. The "v3" design looks at the driver lifetime as a sequence of (ideally just one) connect-process-disconnect cycles. The whole cycle is initiated by a single background thread, and so these operations have greatly reduced possibility for race conditions. They can use things like catch blocks and try-with-resources to exit with a clean state, rather than having to clean up during each corrective action. Initialization and orderly shutdown naturally share the same logic with reconnection. The actual driver operations interact with this background thread in a number of ways:
All in all, I think it's going to be much easier to reach a high degree of confidence in the v3 design. I'm also skipping resume tokens entirely for now. In v2 they made things very difficult to reason about and generally just made things worse, so I didn't use them anyway; in v3, this could lead to a substantial reduction in complexity, and I can try to add resume token support again later if we find a need. |
Beta Was this translation helpful? Give feedback.
-
I've encountered a problem with v3 that I didn't see on v2. It's documented in this MongoDB support case, though I'm not sure how much of this is visible to the public. This seems to be a showstopper. Until I know more about why this is failing. I can't really proceed with v3. 😞 I'll paste the description here:
|
Beta Was this translation helpful? Give feedback.
-
Alright, I think @zAlbee figured it out: when we read the The usual risk of |
Beta Was this translation helpful? Give feedback.
-
After a bunch more bug fixing, v3 is now the most reliable MongoDriver. I wrote some automation to re-run the automated tests, and I actually had to disable the v1 and v2 drivers because they were failing more often than v3. I left the tests running in a loop overnight, and in 8 hours they passed 72 times with no failures, which is a new record. My current plan:
|
Beta Was this translation helpful? Give feedback.
-
Alright, it's released! Version 0.0.90 is the first one to use the new MongoDriver. 🎉 |
Beta Was this translation helpful? Give feedback.
-
In the recent "resilient" MongoDriver work, I have been struggling to distill out some principles of operation so I can understand how
MongoDriver
ought to behave in all situations.I think I might have just figured it out.
The idea is to have all change event operations in the same background thread. This concept actually includes the loading of the initial state which, though not actually a change event, is motivated by the need to coordinate carefully with the event stream so we process only events that occurred after that initial state.
The psudocode for the event thread becomes a
scheduleWithFixedDelay
call with the following procedure:a. Open change stream cursor, either
- Using last processed resume token, or
- Getting a fresh resume token, and the initial event
b. Detect format, open format driver, load initial state
This achieves numerous things all at once, compared to the current design, which does only #2 on the event thread:
I need to think about this some more, but it seems promising.
Beta Was this translation helpful? Give feedback.
All reactions