Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImmutableDB is leaking file handles #1543

Closed
nfrisby opened this issue Jan 30, 2020 · 3 comments · Fixed by #1547
Closed

ImmutableDB is leaking file handles #1543

nfrisby opened this issue Jan 30, 2020 · 3 comments · Fixed by #1547
Assignees
Labels
bug Something isn't working consensus issues related to ouroboros-consensus immutable db
Milestone

Comments

@nfrisby
Copy link
Contributor

nfrisby commented Jan 30, 2020

On master (commit 62d8e66), this test case fails with ImmutableDB is leaking file handles.

          let ncn5 = NumCoreNodes 5 in
          prop_simple_real_pbft_convergence NoEBBs (SecurityParam 2) TestConfig
            { numCoreNodes = ncn5
            -- Still fails if I increase numSlots.
            , numSlots     = NumSlots 54
            , nodeJoinPlan = NodeJoinPlan $ Map.fromList
              [(CoreNodeId 0,SlotNo {unSlotNo = 0})
              ,(CoreNodeId 1,SlotNo {unSlotNo = 0})
              ,(CoreNodeId 2,SlotNo {unSlotNo = 0})
              ,(CoreNodeId 3,SlotNo {unSlotNo = 53}),(CoreNodeId 4,SlotNo {unSlotNo = 53})
              ]
              -- Passes if I drop either of these restarts.
            , nodeRestarts = NodeRestarts (Map.fromList [(SlotNo {unSlotNo = 50},Map.fromList [(CoreNodeId 0,NodeRestart)]),(SlotNo {unSlotNo = 53},Map.fromList [(CoreNodeId 3,NodeRestart)])])
            , nodeTopology = meshNodeTopology ncn5
              -- Slot length of 19s passes, and 21s also fails; I haven't seen this matter before.
            , slotLengths  = singletonSlotLengths (slotLengthFromSec 20)
            , initSeed     = Seed {getSeed = (15062108706768000853,6202101653126031470,15211681930891010376,1718914402782239589,12639712845887620121)}
            }

With some instrumentation, I determined that node c0 leaks one ImmDB file handle, for path @00000.epoch@, read only, offset at 0. It was opened by a ChainSyncServer thread.

The test case seems somewhat fragile, since the 'slotLengths' value seems to matter.

@nfrisby nfrisby added bug Something isn't working consensus issues related to ouroboros-consensus labels Jan 30, 2020
@nfrisby
Copy link
Contributor Author

nfrisby commented Jan 30, 2020

It seems to be fixed by adding uninterruptibleMask_ to the ResourceRegistry.allocateEither of the iterator in Ouroboros.Storage.ChainDB.Impl.ImmDB.registeredStream. But there are several potential MVar and IO operations within that bracket; are we sure we want all of them to uninterruptible?

As an alternative, perhaps we could instead add the file handle to a registry ASAP instead of waiting for it to make it into an iterator which then gets added to a registry?

@nfrisby
Copy link
Contributor Author

nfrisby commented Jan 30, 2020

FYI, @mrBliss I pushed up a branch mrBliss/fix-1543 that just has a commit on it that makes test-consensus only run this repro.

@mrBliss mrBliss added this to the S6 2020-02-13 milestone Jan 31, 2020
@mrBliss mrBliss self-assigned this Jan 31, 2020
mrBliss added a commit that referenced this issue Jan 31, 2020
Fixes #1543.

The ChainDB.Iterators and ChainDB.Readers use ImmutableDB.Iterators
internally. Both take a ResourceRegistry to allocate the ImmutableDB.Iterator
in so that the ImmutableDB.Iterator and the open handle it contains is closed
when an exception occurs.

If an exception occurs at the wrong time while opening an
ImmutableDB.Iterator, after opening the handle, we might leak the handle. Fix
this by directly allocating the handle in the ResourceRegistry instead of the
ImmutableDB.Iterator that refers to the handle.
@mrBliss
Copy link
Contributor

mrBliss commented Jan 31, 2020

FYI, @mrBliss I pushed up a branch mrBliss/fix-1543 that just has a commit on it that makes test-consensus only run this repro.

Thanks for that! (After force-pushing my fix, that commit is now gone.)

iohk-bors bot added a commit that referenced this issue Jan 31, 2020
1547: ImmutableDB.Iterator: allocate handles in the ResourceRegistry r=mrBliss a=mrBliss

Fixes #1543.

The ChainDB.Iterators and ChainDB.Readers use ImmutableDB.Iterators internally. Both take a ResourceRegistry to allocate the ImmutableDB.Iterator in so that the ImmutableDB.Iterator and the open handle it contains is closed when an exception occurs.

If an exception occurs at the wrong time while opening an ImmutableDB.Iterator, after opening the handle, we might leak the handle. Fix this by directly allocating the handle in the ResourceRegistry instead of the ImmutableDB.Iterator that refers to the handle.

Co-authored-by: Thomas Winant <[email protected]>
@iohk-bors iohk-bors bot closed this as completed in 4919c4a Jan 31, 2020
iohk-bors bot added a commit that referenced this issue Mar 2, 2020
1506: Fix cloneBlockchainTime r=nfrisby a=nfrisby

Fixes #1489. Fixes #1524.

Fixing Issue #1489 (let nodes lead when they join) and letting k vary in the
range [2 .. 10] since the dual-ledger tests do that now revealed several
Issues.

Issues in the library, not just the test infrastructure:

  * Issue #1505 -- `removeTxs` cannot use the fast path when validating after
    removing or else we might have dangling tx inputs references. Was fixed
    by #1565.

  * Issue #1511, bullet 1 (closed) -- The `Empty` cases in
    `prevPointAndBlockNo` were wrong. Recent PRs have addressed this: #1544 and
    #1589.

  * Issue #1543 (closed) -- A bracket in `registeredStream` was spoiled by an
    interruptible operation. Thomas' PR re-designed the vulnerability away. I
    think this is unrelated to the other changes; it was lurking and happened
    to pop up here just because I've been running hundreds of thousands of
    tests.

Issues only in the test infrastructure:

  * Issue #1489: let nodes lead when they join. This bug slipped in recently,
    when I added cloning of `BlockchainTime`s as part of the
    restarting/rekeying loop in the test infrastructure.

  * _PBFT reference simulator_. (Was masked by 1489.) Model competing 1-block
    chains in Ref.PBFT and use its results where applicable instead of the
    .Expectations module. Check that the PBFT threadnet and Ref.PBFT.simulate
    results agree on `Ref.Nominal` slots.

    The Ref.PBFT module had been making assumptions that were accurate given
    the guards in RealPBFT generators, given the `k >= n` regime. But outside
    of that regime, when the security parameter exceeds the node count, it
    wasn't enough. Also, it couldn't be compared against the PBFT threadnet
    because of those assumptions.

  * _PBFT reference simulator_. (Cascade of above.) Add
    `definitelyEnoughBlocks` to confirm the "at least k blocks in 2k slots"
    invariant in `genRealPBFTNodeJoinPlan`. The existing guards in the RealPBFT
    generators are intentionally insufficient by themselves; this way we can
    optimize them to avoid O(n^2) complexity without risking divergence from
    the `suchThat`s.

  * _Origin corner-case_. (Was masked by 1489.) Discard DualPBFT tests that
    forge in slot 0. The current `cardano-ledger-specs` doesn't allow for that.
    My hypothesis is that `cvsLastSlot` would need to be able to represent
    origin.

  * _Dlg cert tx_. Adjust `genNodeRekeys` to respect "If node X rekeys in slot
    S and Y leads slot S+1, then either the topology must connect X and Y
    directly, or Y must join before slot S."

  * _Dlg cert tx_. (Was (statistically?) masked by relatively large k.) Add
    `rekeyOracle` and use it to determine which epoch number to record in the
    dlg cert. The correct value depends on which block the dlg cert tx will end
    up in, not when we first add it to our mempool.

  * _Dlg cert tx_. (Was (statistically?) masked by relatively large k.) Add the
    dlg cert tx to the mempool each time the ledger changes.

Co-authored-by: Nicolas Frisby <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working consensus issues related to ouroboros-consensus immutable db
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants