Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track deletes only in the tombstone map instead of maintaining as copy #27868

Merged
merged 21 commits into from
Feb 19, 2018

Conversation

s1monw
Copy link
Contributor

@s1monw s1monw commented Dec 18, 2017

Today we maintain a copy of every delete in the live version maps. This is unnecessary
and might add quite some overhead if maps grow large. This change moves out the deletes
tracking into the tombstone map only and relies on the cleaning of tombstones when deletes
are collected.

Today we maintain a copy of every delete in the live version maps. This is unnecessary
and might add quite some overhead if maps grow large. This change moves out the deletes
tracking into the tombstone map only and relies on the cleaning of tombstones when deletes
are collected.
@s1monw
Copy link
Contributor Author

s1monw commented Dec 20, 2017

@bleskes I opened #27920 for the last commit but it's needed here.

@colings86 colings86 added v6.3.0 and removed v6.2.0 labels Jan 22, 2018
@s1monw
Copy link
Contributor Author

s1monw commented Jan 23, 2018

@bleskes can you take a look a this if you have time?

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Production code LGTM (with suggestions that can be rejected if you don't like them). I like the approach. My only worry is that we now don't properly track the memory that will be used by deletes for refresh purposes. Say for example that someone disables refresh (or uses AUTO) and they run a delete by query. The tombstone will grow and we don't see a refresh. WDYT?

I left a question about a potential testing gap.

@@ -54,6 +55,7 @@
// that will prevent concurrent updates to the same document ID and therefore we can rely on the happens-before guanratee of the
// map reference itself.
private boolean unsafe;
private final AtomicLong minDeleteTimestamp = new AtomicLong(Long.MAX_VALUE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment/java doc to what this mean? (minimum timestamp of delete operations that were made while this map was active. this is used to make sure they are kept in the tombstone)

@@ -97,15 +107,20 @@ void markAsUnsafe() {
// have the volatile read of the Maps reference to make it visible even across threads.
boolean needsSafeAccess;
final boolean previousMapsNeededSafeAccess;
/** Tracks bytes used by current map, i.e. what is freed on refresh. For deletes, which are also added to tombstones, we only account
* for the CHM entry here, and account for BytesRef/VersionValue against the tombstones, since refresh would not clear this RAM. */
final AtomicLong ramBytesUsed;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to move it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea - maybe move it to VersionLookup and the VersionLookup class for the tombstones as well? then all accounting is in one place?

// Also enroll the delete into tombstones, and account for its RAM too:
final VersionValue prevTombstone = tombstones.put(uid, version);
// We initially account for BytesRef/VersionValue RAM for a delete against the tombstones, because this RAM will not be freed up
// on refresh. Later, in removeTombstoneUnderLock, if we clear the tombstone entry but the delete remains in current, we shift
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is wrong now.

map.beforeRefresh();
assertEquals(new VersionValue(1, 1, 1), map.getUnderLock(uid("test")));
assertEquals(new VersionValue(1,1,1), map.getUnderLock(uid("test")));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space fanatic 👅 ;)

lastDeleteVersionPruneTimeMSec = timeMSec;
}

// testing
void clearDeletedTombstones() {
versionMap.clearTombstones();
versionMap.pruneTombstones(-1, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait - this feels weird - current time is -1 and interval is 0? Shouldn't current time be Long.MAX_VALUE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time means we clean everything, I think that is correct?

@@ -137,24 +142,25 @@ public void testConcurrently() throws IOException, InterruptedException {
}
if (isDelete == false && rarely()) {
versionValue = new DeleteVersionValue(versionValue.version + 1, versionValue.seqNo + 1,
versionValue.term, Long.MAX_VALUE);
versionValue.term, clock.incrementAndGet());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+++++

} else {
versionValue = new VersionValue(versionValue.version + 1, versionValue.seqNo + 1, versionValue.term);
}
values.put(bytesRef, versionValue);
map.putUnderLock(bytesRef, versionValue);
}
if (rarely()) {
map.pruneTombstones(0, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you comment we explicitly don't do any pruning? otherwise this looks odd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand

@@ -137,24 +142,25 @@ public void testConcurrently() throws IOException, InterruptedException {
}
if (isDelete == false && rarely()) {
versionValue = new DeleteVersionValue(versionValue.version + 1, versionValue.seqNo + 1,
versionValue.term, Long.MAX_VALUE);
versionValue.term, clock.incrementAndGet());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can comment on two lines above, so comment here: map.removeTombstoneUnderLock(bytesRef); isn't meant for public usage, right? if we index on a delete, the delete should go away.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not meant for public use

}
do {
Map<BytesRef, VersionValue> valueMap = new HashMap<>(map.getAllCurrent());
final Map<BytesRef, VersionValue> valueMap = new HashMap<>(map.getAllCurrent());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is now less strong as we don't track deletes here any more... we should somehow show that we capture deletes correctly in the various phases of refresh and that we don't forget them until we prune. Please me correct if I'm missing something. This is tricky ;)

@s1monw
Copy link
Contributor Author

s1monw commented Feb 13, 2018

My only worry is that we now don't properly track the memory that will be used by deletes for refresh purposes. Say for example that someone disables refresh (or uses AUTO) and they run a delete by query. The tombstone will grow and we don't see a refresh. WDYT?

I must be missing something, a refresh doesn't clean tombstones they are cleaned over time with the GC deletes. I am not sure I understand why you would want to refresh based on them.

@bleskes
Copy link
Contributor

bleskes commented Feb 13, 2018

must be missing something, a refresh doesn't clean tombstones they are cleaned over time with the GC deletes. I am not sure I understand why you would want to refresh based on them.

We prune deletes are each delete and on refresh. If you have a burst of deletes, we can't prune them when they're done and we have to wait for the refresh cycles to clean them up. We also currently don't refresh every 1s but wait for the memory signature of the shard to grow - as reported by InternalEngine#getIndexBufferRAMBytesUsed , including that version map size). That one now will ignore the delete signature. The more I look at this the more I think we should come up with some other mechanism to prune deletes. This is too implicit imo.

@s1monw
Copy link
Contributor Author

s1monw commented Feb 13, 2018

We prune deletes are each delete and on refresh. If you have a burst of deletes, we can't prune them when they're done and we have to wait for the refresh cycles to clean them up. We also currently don't refresh every 1s but wait for the memory signature of the shard to grow - as reported by InternalEngine#getIndexBufferRAMBytesUsed , including that version map size). That one now will ignore the delete signature. The more I look at this the more I think we should come up with some other mechanism to prune deletes. This is too implicit imo.

we do call maybePruneDeletedTombstones on every delete I think that is enough? we can still call it on a regular basis but I am not sure that is needed.

@bleskes
Copy link
Contributor

bleskes commented Feb 13, 2018

we do call maybePruneDeletedTombstones on every delete I think that is enough?

I tried to explain it with:

If you have a burst of deletes, we can't prune them when they're done and we have to wait for the refresh cycles to clean them up.

We can't clean them up because of gc deletes. We stick to them for a minute (by default).

@s1monw
Copy link
Contributor Author

s1monw commented Feb 13, 2018

We can't clean them up because of gc deletes. We stick to them for a minute (by default).

but we only keep the ones 1/4 of the GC deletes interval. I am not sure if we are chasing corner cases here and if we don't get any changes (which would trigger no refresh) that eventually results in a flush will be enough.

@bleskes
Copy link
Contributor

bleskes commented Feb 13, 2018

I think there's some confusion. We check for gc pruning every 1/4 of the gc interval:

        if (engineConfig.isEnableGcDeletes() && engineConfig.getThreadPool().relativeTimeInMillis() - lastDeleteVersionPruneTimeMSec > getGcDeletesInMillis() * 0.25) {
            pruneDeletedTombstones();
        }

but the actual pruning keeps delete for at least the gc deletes interval:

                if (timeMSec - versionValue.time > getGcDeletesInMillis()) {
                        versionMap.removeTombstoneUnderLock(uid);
                    }
                }

@s1monw
Copy link
Contributor Author

s1monw commented Feb 13, 2018

I do understand what that means but what different does it make. If we can't collect we can't collect / prune. I don't see the issue compared to what we do today

@clintongormley clintongormley added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Engine :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Feb 13, 2018
@bleskes
Copy link
Contributor

bleskes commented Feb 14, 2018

I don't see the issue compared to what we do today

I think this getting to be a bigger discussion that what I meant. There is a difference, but we can choose to live with it and/or fix in a follow up (hence my comment that the production code is LGTM). I'll reach out to discuss via another channel.

@s1monw
Copy link
Contributor Author

s1monw commented Feb 16, 2018

@bleskes i pushed changes to address the pending issues

@s1monw s1monw requested a review from bleskes February 16, 2018 13:51
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @s1monw

@s1monw s1monw merged commit 56edb5e into elastic:master Feb 19, 2018
@s1monw s1monw deleted the track_deletes_only_in_tombstones branch February 19, 2018 11:25
s1monw added a commit that referenced this pull request Feb 19, 2018
#27868)

Today we maintain a copy of every delete in the live version maps. This is unnecessary
and might add quite some overhead if maps grow large. This change moves out the deletes
tracking into the tombstone map only and relies on the cleaning of tombstones when deletes
are collected.
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Feb 19, 2018
* master:
  Enable selecting adaptive selection stats
  Remove leftover mention of file-based scripts
  Fix threading issue on listener notification (elastic#28730)
  Revisit deletion policy after release the last snapshot (elastic#28627)
  Remove unused method
  Track deletes only in the tombstone map instead of maintaining as copy (elastic#27868)
  [Docs] Correct typo in README.textile (elastic#28716)
  Fix AdaptiveSelectionStats serialization bug (elastic#28718)
  TEST: Fix InternalEngine#testAcquireIndexCommit
  Add note on temporary directory for Windows service
  Added coming annotation and breaking changes link to release notes script
  Remove leftover PR link for previously disabled bwc tests
  Separate acquiring safe commit and last commit (elastic#28271)
  Fix BWC issue of the translog last modified age stats
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement v6.3.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants