Reimplement delete-by-query as a bulk request #7052

clintongormley · 2014-07-28T09:28:22Z

Delete-by-query is problematic: eg when deleting a document should trigger some action, eg: removing a percolator or removing a parent-child link. It is also executed on both primary and replica, and can result in deleting different documents.

Delete-by-query should be replaced with a document-by-document delete using the bulk API.

Fixes: #6025
Fixes: #1712
Fixes: #5797
Fixes: #3593

Depends on #6914

mikemccand · 2014-07-29T14:16:06Z

We could use IndexWriter's tryDeleteDocument to delete by docID for each doc matching the query? Typically it would be successful, and fast, if the reader used for searching is "recent"; when it fails to do the delete, you'd have to fall back to delete-by-Term.

Alternatively, it's also possible to wrap any Query and "spy on" the communication between the consumer (IndexWriter in this case) and that query, to see all docIDs that were visited; this way we could continue to use IW's more efficient delete-by-Query, but gather up all docIDs that were in fact deleted. But this is a more hairy/evil/complex solution... though I think Solr already has something doing this.

uschindler · 2014-12-31T10:41:52Z

Very nice: This would also allow to return the number of deleted documents to the consumer. So we can change the reply to include the document count. I sometimes have the problem that I need the count of deleted documents, which is hairy: You can first execute the query as count and later execute the delete, but this fails on high load because of non-atomicness.

Delete-by-query is incredibly costly because it forces a refresh each time, so if you are also indexing this can cause massive segment explosion. This change throttles delete-by-query when merges can't keep up. It's likely not enough (#7052 is the long-term solution) but can only help. Closes #9986

mikemccand · 2015-03-26T18:41:53Z

On #10067 @kimchy pointed out that we need not wait for task management API for this issue, i.e. we can just synchronously (in the user's current request) do the scan/scroll + bulk deletes...

uschindler · 2015-03-26T18:47:18Z

+1

This also means the (incomplete) deprecation on delete-by-query APIs could be removed: #10082

TwP · 2015-03-29T05:32:08Z

As a user who is quite a big fan of delete-by-query, I'm very much hoping it can be implemented in Elasticsearch itself as a scan/scroll + bulk deletes. Moving the scan/scroll + bulk deletes logic into the client would create a bit of wasted network traffic. But if that is the solution moving forward, then so be it. My vote is to keep the current API in place and change the underlying implementation.

Out of perverse curiosity .. with a client side solution, can the bulk deletes happen in parallel with the scroll/scan operations? As a set of documents are returned by the call to /_search/scroll, can those documents immediately be deleted via bulk delete operations? Or would I need to complete the entire scan and then delete documents.

Will Elasticsearch be unhappy if we start deleting the data that it is currently iterating over.

uschindler · 2015-03-29T08:09:47Z

Hi @TwP,
you can do the scan and scroll in parallel. Scan and scroll uses the IndexReader openend at the time of the query and keeps it open. Deletes happening while scrolling will not be visible to this IndexReader until scrolling is done. All other search queries (or other scrolls) in parallel will see the deletes, of course. While scrolling you see a consistent view on the index.

I agree, we should keep the current API and implement the scan-scroll-delete behind the API (please also keep the Java API, not only the REST API (RequestBuilders,...). I am also a big fan of this API. One good thing would be: the delete-by-query can now return the number of deleted documents, which would be great, so I would be happy if the DeleteByQueryResponse would contain a "long getCount()"!

mikemccand · 2015-03-29T08:38:17Z

you can do the scan and scroll in parallel.

That's right: scan/scroll gives you a point-in-time view of the index. It won't see any changes that happened after that time, as long as you keep the scroll "alive".

I'm very much hoping it can be implemented in Elasticsearch itself as a scan/scroll + bulk deletes.

This is the plan... AbstractClient's deleteByQuery methods will be changed to final methods that do the scan/scroll + bulk delete.

please also keep the Java API, not only the REST API (RequestBuilders,...).

Yeah this is also now the plan...

the delete-by-query can now return the number of deleted documents, which would be great, so I would be happy if the DeleteByQueryResponse would contain a "long getCount()"!

OK I'll add that, and also "int getCount()" to each IndexDeleteByQueryResponse, because you can pass multiple index names to DBQ.

uschindler · 2015-03-29T09:32:02Z

This just came into my mind: I already implemented the bulk delete stuff locally (in Java client, see https://sourceforge.net/p/panfmp/code/647/tree//main/trunk/src/de/pangaea/metadataportal/processor/DocumentProcessor.java?diff=516c2e8d5fcbc9791083b0a3:646 for example). For the given code version numbers are not really a problem, but for the deleteByQuery API to be consistent, it should execute scan-scroll and collect all doc ids AND version (!!!) numbers. When doing the bulk delete, each of this deletes should use the version number as returned by the scan-scroll, otherwise it could happen that updates/inserts of a concurrent indexing are destroyed.

kimchy · 2015-03-29T09:33:10Z

@uschindler aye, it should be implemented similar to what we do with single document update, so for example, making sure to handle parent and routing into account

s1monw · 2015-03-30T14:03:29Z

I think we should split the problems into engine level and shard level. The shard level problems (consistency between replica and primary) can be solved in a different PR than the engine level problems.

I think we should take babysteps here and implement the deletion as a simple lucene search inside the engine and keep on using the query inside the transaction log etc. just keep everything as it is exception of iterating the hits inside the engine and delete them one by one. Also replication stays the same for now. This solves the refresh problem as well as the OOM etc. we can also utilize IW#tryDeleteDocument to speed up things.

The shard level problem is a bigger one and should maybe be solved by using sequence IDs (delete after) or a lock on the shard level that prevents any other changes to happen concurrently on both replica and primaries.

Solving the issue on the engine level lets up make progress on the right level IMO and we can fix the distributed system problems on the layer where they are happening.

mikemccand · 2015-03-30T17:09:47Z

I think we should split the problems into engine level and shard level.

+1: if we can really decouple the two (engine impl vs shard consistency), that's wonderful. I can tackle the engine level, to fix the OOME during concurrent indexing (#6025).

s1monw · 2015-03-31T13:16:59Z

@mikemccand to be honest I think we should just stop indexing for the duration of the delete by query. Simple solution though...

nariman-haghighi · 2015-06-11T13:49:15Z

To echo @TwP's sentiment, this is a massively common API that many teams use regularly to delete thousands of documents. Moving this to the client or deprecating it outright would be an utter disaster. A new server-side implementation that preserves the API is the only sensible path, please provide some final guidance on this so we can plan appropriately.

tlrx · 2015-06-11T14:41:38Z

@nariman-haghighi you may be interested in #11516 and #11584.

traviscollins · 2015-07-04T03:51:06Z

I agree with @TwP and @nariman-haghighi - removing the delete-by-query API is really not serving users well. This is basic functionality that should be implemented on the server side in the core API with a simple client method.

MattFriedman · 2015-07-21T18:43:21Z

I see this issue is now closed. Has it been resolved? Will the native Elastic interface be maintained or removed? I read all the comments and I'm still not sure.

clintongormley · 2015-07-23T10:12:42Z

@MattFriedman in 2.0 delete-by-query has been implemented using bulk, and moved into a plugin: https://github.com/elastic/elasticsearch/tree/master/plugins/delete-by-query

Delete-by-query is incredibly costly because it forces a refresh each time, so if you are also indexing this can cause massive segment explosion. This change throttles delete-by-query when merges can't keep up. It's likely not enough (elastic#7052 is the long-term solution) but can only help. Closes elastic#9986

pulkitsinghal · 2016-08-06T05:54:51Z

@clintongormley - I went looking but did not find it ... should i instead rely on the summary I found in the commits stating:

The Delete-By-Query plugin has been removed in favor of a new Delete By Query API implementation in core. It now supports throttling, retries and cancellation but no longer supports timeouts. Instead use the cancel API to cancel deletes that run too long.

nik9000 · 2016-08-06T10:42:04Z

It should be mentioned somewhere in the migration docs but I'm on mobile.
Yes, in 5.0 delete-by-query is included in the distribution. It is
technically shipped in the mobile called reindex, meaning it is in a pre
bundled plugin.

On Aug 6, 2016 1:54 AM, "Pulkit Singhal" [email protected] wrote:

@clintongormley https://github.com/clintongormley - I went lookign but
did not find it ... should i instead rely on the summary I found in the
commits stating:

The Delete-By-Query plugin has been removed in favor of a new Delete By
Query API implementation in core. It now supports throttling, retries and
cancellation but no longer supports timeouts. Instead use the cancel API to
cancel deletes that run too long.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7052 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANLorGaujTViuxaRg6UAMUBnB-EN3ymks5qdCGvgaJpZM4CRdK3
.

clintongormley added enhancement labels Jul 28, 2014

This was referenced Jul 28, 2014

Provide a SearchContext for DeleteByQuery #2923

Closed

Percolators: Support delete-by-query #1712

Closed

clintongormley mentioned this issue Jul 30, 2014

Strange behavior after .percolator type deletion #7087

Closed

clintongormley mentioned this issue Dec 30, 2014

deleted_* stats are not updated when running delete by query #5797

Closed

clintongormley changed the title ~~Add bulk execution mode to delete-by-query~~ Reimplement delete-by-query as a bulk request Dec 31, 2014

clintongormley mentioned this issue Dec 31, 2014

Delete by query should not silently refresh index #3593

Closed

clintongormley mentioned this issue Feb 12, 2015

DeleteByQueryResponse doesn't return count of deleted records #9654

Closed

mikemccand mentioned this issue Mar 4, 2015

Delete By Query under heavy indexing load causes OOM errors #6025

Closed

mikemccand mentioned this issue Mar 11, 2015

Core: remove delete-by-query #10067

Closed

mikemccand removed the stalled label Mar 26, 2015

mikemccand mentioned this issue Mar 26, 2015

Remove current delete-by-query implementation #10288

Closed

mikemccand mentioned this issue Mar 30, 2015

Refresh if many deletes in a row use up too much version map RAM #10312

Closed

This was referenced Apr 2, 2015

Shard stuck in relocating state with recovery stage=translog #9226

Closed

Revert delete-by-query deprecations #10443

Closed

tlrx removed the help wanted adoptme label Jun 11, 2015

tlrx closed this as completed in ba35406 Jun 17, 2015

pulkitsinghal mentioned this issue Aug 6, 2016

Getting error when trying to delete: self.db.deleteByQuery is not a function loopbackio/loopback-connector-elastic-search#45

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplement delete-by-query as a bulk request #7052

Reimplement delete-by-query as a bulk request #7052

clintongormley commented Jul 28, 2014

mikemccand commented Jul 29, 2014

uschindler commented Dec 31, 2014

mikemccand commented Mar 26, 2015

uschindler commented Mar 26, 2015

TwP commented Mar 29, 2015

uschindler commented Mar 29, 2015

mikemccand commented Mar 29, 2015

uschindler commented Mar 29, 2015

kimchy commented Mar 29, 2015

s1monw commented Mar 30, 2015

mikemccand commented Mar 30, 2015

s1monw commented Mar 31, 2015

nariman-haghighi commented Jun 11, 2015

tlrx commented Jun 11, 2015

traviscollins commented Jul 4, 2015

MattFriedman commented Jul 21, 2015

clintongormley commented Jul 23, 2015

pulkitsinghal commented Aug 6, 2016 •

edited

Loading

nik9000 commented Aug 6, 2016

Reimplement delete-by-query as a bulk request #7052

Reimplement delete-by-query as a bulk request #7052

Comments

clintongormley commented Jul 28, 2014

mikemccand commented Jul 29, 2014

uschindler commented Dec 31, 2014

mikemccand commented Mar 26, 2015

uschindler commented Mar 26, 2015

TwP commented Mar 29, 2015

uschindler commented Mar 29, 2015

mikemccand commented Mar 29, 2015

uschindler commented Mar 29, 2015

kimchy commented Mar 29, 2015

s1monw commented Mar 30, 2015

mikemccand commented Mar 30, 2015

s1monw commented Mar 31, 2015

nariman-haghighi commented Jun 11, 2015

tlrx commented Jun 11, 2015

traviscollins commented Jul 4, 2015

MattFriedman commented Jul 21, 2015

clintongormley commented Jul 23, 2015

pulkitsinghal commented Aug 6, 2016 • edited Loading

nik9000 commented Aug 6, 2016

pulkitsinghal commented Aug 6, 2016 •

edited

Loading