Skip to content
This repository has been archived by the owner on Jun 1, 2021. It is now read-only.

safer tracking of issued leveldb iterators #231

Conversation

jrudolph
Copy link

Previously, creating an iterator and tracking wasn't an atomic operation
resulting in iterator operations being run after the db had been closed.
This resulted in all kinds of crashes because of illegal native state
during the shutdown procedure.

/cc @volkerstampa

@jrudolph
Copy link
Author

(I will update this after your module restructuring but it's pretty self-contained so this shouldn't be a big problem)

@krasserm
Copy link
Contributor

LGTM, many thanks for fixing! Do you also have a running test for reproducing the crashes, for example, one that shuts down the log actors while several readers are trying to read/replay events?

Also, please use the wip-<issue-nr>-<descritpion> or fix-<issue-nr>-<descritpion> pattern for branch names. No need to change for this PR though 😃

@jrudolph
Copy link
Author

I'll try to write a test that exposes the behavior and fix the suggested issues in the next days.

@krasserm
Copy link
Contributor

Great, thanks! I'll also have #169 ready soon.

@krasserm
Copy link
Contributor

@jrudolph please sign the CLA so that I can merge your contribution. Btw, #169 is already merged to master.

@jrudolph
Copy link
Author

Cool, I'm now at writing a test and merging to the new project structure.

@jrudolph
Copy link
Author

I updated this PR to the current master and addressed your suggestions. I tried writing a test but didn't managed to write one that would evoke the crash without having to make changes to the actor code itself. I guess the reason is that the crash relies heavily on the scheduling of threads so that the previous Thread.sleep(500) in postStop ends at the exact right moment to catch the EventIterator code in the right moment.

Even with a test for this issue in place it wouldn't make sure that the new code works correctly. A better long term solution would be putting the whole native LevelDB API behind another wrapper that takes care for these concerns and which can be tested in isolation.

Regarding the CLA, I sent a mail to [email protected] and will report back once I've signed it.

@jrudolph
Copy link
Author

Signed the CLA now.

@krasserm
Copy link
Contributor

Thanks. Can you please investigate the build failures? https://travis-ci.org/RBMHTechnology/eventuate/builds/115815784#L1167

Previously, creating an iterator and tracking wasn't an atomic operation
resulting in iterator operations being run after the db had been closed.
This resulted in all kinds of crashes because of illegal native state
during the shutdown procedure.
@jrudolph jrudolph force-pushed the w/fix-leveldb-crashes-during-shutdown branch from e80cac3 to 8b0fd58 Compare March 17, 2016 15:58
@jrudolph
Copy link
Author

Looks like I made a last-minute config name change... Let's wait what the next build says.

@jrudolph
Copy link
Author

Now all checks have passed.

@krasserm
Copy link
Contributor

In this commit I experimented with an alternative solution proposed by @volkerstampa. Here, usage of iterators is done in child actors of the parent LeveldbEventLog actor. When the parent actor stops, Akka will ensure that the child actors (using the iterators) will be stopped first, hence there won't be active iterators the the parent actors stops LevelDB. This approach is IMO less complex than using explicit iterator management with a global reference counter. WDYT?

@jrudolph
Copy link
Author

Yes, that sounds like a nice solution as well. Maybe it could be made a bit more explicit in the code why it is done like this. The flow of data is a bit subtle to understand because the new child actor closes over the state of the parent actor which makes it somewhat hard to see in which way the invariants are enforced.

@krasserm
Copy link
Contributor

@jrudolph can you make a proposal how to improve on that?

@volkerstampa
Copy link
Contributor

@krasserm I just tried the child-actor based approach with my local setup and can confirm that it seems to fix the problems of crashing tests.

@jrudolph The child-actor closes in the same way over the state of the parent actor as the future did before, right? So, at least to me this does not make things harder to understand than before!?!

@jrudolph
Copy link
Author

So, at least to me this does not make things harder to understand than before!?!

Yes, I agree. It's a pragmatic solution that requires minimal changes, it's certainly not worse than this PR. From my side it would probably be enough to add a comment to postStop why this is now safe.

So, if there's something left to be desired it's the same as with the current state of this PR. It's more about code organization. It would be nice if there was a single component which alone is responsible for making the LevelDB access crash-safe. In the best case, leveldbjni would ensure crash-safety itself (IMO it's a reasonable expectation for a Java lib not to crash the JVM in any cases...).

@krasserm krasserm added bug and removed enhancement labels Mar 19, 2016
@krasserm krasserm removed this from the 0.7 milestone Mar 19, 2016
@krasserm
Copy link
Contributor

@volkerstampa @jrudolph thanks for feedback. I just created #239 (superseding this PR) with an additional comment in postStop() how the issue has been fixed. Please review, thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants