Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Clean left behind model state docs #30659

Conversation

dimitris-athanasiou
Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou commented May 16, 2018

It is possible for model state documents to be
left behind in the state index. This may be
because of bugs or uncontrollable scenarios.
In any case, those documents may take up quite
some disk space when they add up. This commit
adds a step in the expired data deletion that
is part of the daily maintenance service. The
new step searches for state documents that
do not belong to any of the current jobs and
deletes them.

Closes #30551

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it will fix the known problem.

I just left a suggestion on how we could improve encapsulation and extend it to any type of orphaned state document.

while (stateDocIdsIterator.hasNext()) {
Deque<String> stateDocIds = stateDocIdsIterator.next();
for (String stateDocId : stateDocIds) {
int modelStateSuffixIndex = stateDocId.indexOf("_model_state_");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to encapsulate this in the ModelState class - say have a method jobIdFromDocId or something like that.

for (String stateDocId : stateDocIds) {
int modelStateSuffixIndex = stateDocId.indexOf("_model_state_");
if (modelStateSuffixIndex < 0) {
// e.g. quantiles, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically we could also get orphaned quantiles and categorizer state documents (due to bugs we don't know about yet or people meddling with the contents of other indices).

So it would be good to add jobIdFromDocId methods to the Quantiles and CategorizerState classes too. The pattern could be they return null if they don't understand the format. Then only continue if none of the classes we store in the state index can find the job ID from the document ID.

}

private void executeDeleteUnusedStateDocs(BulkRequestBuilder deleteUnusedStateRequestBuilder, ActionListener<Boolean> listener) {
LOGGER.info("Found {} unused model state documents; attempting to delete",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By convention the {} should be in square brackets.

Also, if we delete all types of orphaned documents then model state -> just state (and in the other message below too).

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

As one final robustness improvement, you could use lastIndexOf() instead of indexOf() in the 3 new extractJobId methods. It would make a difference if somebody named their job detect_anomalies_in_my_quantiles or something like that. But maybe I'm being too paranoid!

@dimitris-athanasiou
Copy link
Contributor Author

@droberts195 No, definitely a good idea. Paranoid is good!

It is possible for model state documents to be
left behind in the state index. This may be
because of bugs or uncontrollable scenarios.
In any case, those documents may take up quite
some disk space when they add up. This commit
adds a step in the expired data deletion that
is part of the daily maintenance service. The
new step searches for state documents that
do not belong to any of the current jobs and
deletes them.

Closes elastic#30551
@dimitris-athanasiou dimitris-athanasiou force-pushed the delete-unused-model-state-docs branch from 245d0bc to 84f4923 Compare May 17, 2018 11:00
@dimitris-athanasiou dimitris-athanasiou merged commit 75665a2 into elastic:master May 17, 2018
@dimitris-athanasiou dimitris-athanasiou deleted the delete-unused-model-state-docs branch May 17, 2018 14:51
dimitris-athanasiou added a commit that referenced this pull request May 17, 2018
It is possible for state documents to be
left behind in the state index. This may be
because of bugs or uncontrollable scenarios.
In any case, those documents may take up quite
some disk space when they add up. This commit
adds a step in the expired data deletion that
is part of the daily maintenance service. The
new step searches for state documents that
do not belong to any of the current jobs and
deletes them.

Closes #30551
dimitris-athanasiou added a commit that referenced this pull request May 17, 2018
It is possible for state documents to be
left behind in the state index. This may be
because of bugs or uncontrollable scenarios.
In any case, those documents may take up quite
some disk space when they add up. This commit
adds a step in the expired data deletion that
is part of the daily maintenance service. The
new step searches for state documents that
do not belong to any of the current jobs and
deletes them.

Closes #30551
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request May 17, 2018
…ngs-to-true

* elastic/master: (25 commits)
  [DOCS] Replace X-Pack terms with attributes
  [ML] Clean left behind model state docs (elastic#30659)
  Correct typos
  filters agg docs duplicated 'bucket' word removal (elastic#30677)
  top_hits doc example description update (elastic#30676)
  [Docs] Replace InetSocketTransportAddress with TransportAdress (elastic#30673)
  [TEST] Account for increase in ML C++ memory usage (elastic#30675)
  User proper write-once semantics for GCS repository (elastic#30438)
  Remove bogus file accidentally added
  Add detailed assert message to IndexAuditUpgradeIT (elastic#30669)
  Adjust fast forward for token expiration test  (elastic#30668)
  Improve explanation in rescore (elastic#30629)
  Deprecate `nGram` and `edgeNGram` names for ngram filters (elastic#30209)
  Watcher: Fix watch history template for dynamic slack attachments (elastic#30172)
  Fix _cluster/state to always return cluster_uuid (elastic#30656)
  [Tests] Add debug information to CorruptedFileIT
  Preserve REST client auth despite 401 response (elastic#30558)
  [test] packaging: add windows boxes (elastic#30402)
  Make xpack modules instead of a meta plugin (elastic#30589)
  Mute ShrinkIndexIT
  ...
dnhatn added a commit that referenced this pull request May 19, 2018
* 6.x:
  Mute testCorruptFileThenSnapshotAndRestore
  Plugins: Remove meta plugins (#30670)
  Upgrade to Lucene-7.4.0-snapshot-59f2b7aec2 (#30726)
  Docs: Add uptasticsearch to list of clients (#30738)
  [TEST] Reduce forecast overflow to disk test memory limit (#30727)
  [DOCS] Removes redundant index.asciidoc files (#30707)
  [DOCS] Moves X-Pack configurationg pages in table of contents (#30702)
  [ML][TEST] Fix bucket count assertion in ModelPlotsIT (#30717)
  [ML][TEST] Make AutodetectMemoryLimitIT less fragile (#30716)
  [Build] Add test admin when starting gradle run with trial license and
  [ML] provide tmp storage for forecasting and possibly any ml native jobs #30399
  Tests: Fail if test watches could not be triggered (#30392)
  Watcher: Prevent duplicate watch triggering during upgrade (#30643)
  [ML] add version information in case of crash of native ML process (#30674)
  Add detailed assert message to IndexAuditUpgradeIT (#30669)
  Preserve REST client auth despite 401 response (#30558)
  Make TransportClusterStateAction abide to our style (#30697)
  [DOCS] Fixes edit URLs for stack overview (#30583)
  [DOCS] Add missing callout in IndicesClientDocumentationIT
  Backport get settings API changes to 6.x (#30494)
  Silence sleep based watcher test
  [DOCS] Replace X-Pack terms with attributes
  Improve explanation in rescore (#30629)
  [test] packaging: add windows boxes (#30402)
  [ML] Clean left behind model state docs (#30659)
  filters agg docs duplicated 'bucket' word removal (#30677)
  top_hits doc example description update (#30676)
  MovingFunction Pipeline agg backport to 6.x (#30658)
  [Docs] Replace InetSocketTransportAddress with TransportAdress (#30673)
  [TEST] Account for increase in ML C++ memory usage (#30675)
  User proper write-once semantics for GCS repository (#30438)
  Deprecate `nGram` and `edgeNGram` names for ngram filters (#30209)
  Watcher: Fix watch history template for dynamic slack attachments (#30172)
  Fix _cluster/state to always return cluster_uuid (#30656)
dnhatn added a commit that referenced this pull request May 19, 2018
* master:
  Scripting: Remove getDate methods from ScriptDocValues (#30690)
  Upgrade to Lucene-7.4.0-snapshot-59f2b7aec2 (#30726)
  [Docs] Fix single page :docs:check invocation (#30725)
  Docs: Add uptasticsearch to list of clients (#30738)
  [DOCS] Removes out-dated x-pack/docs/en/index.asciidoc
  [DOCS] Removes redundant index.asciidoc files (#30707)
  [TEST] Reduce forecast overflow to disk test memory limit (#30727)
  Plugins: Remove meta plugins (#30670)
  [DOCS] Moves X-Pack configurationg pages in table of contents (#30702)
  TEST: Add engine log to testCorruptFileThenSnapshotAndRestore
  [ML][TEST] Fix bucket count assertion in ModelPlotsIT (#30717)
  [ML][TEST] Make AutodetectMemoryLimitIT less fragile (#30716)
  Default copy settings to true and deprecate on the REST layer (#30598)
  [Build] Add test admin when starting gradle run with trial license and
  This implementation lazily (on 1st forecast request) checks for available diskspace and creates a subfolder for storing data outside of Lucene indexes, but as part of the ES data paths.
  Tests: Fail if test watches could not be triggered (#30392)
  [ML] add version information in case of crash of native ML process (#30674)
  Make TransportClusterStateAction abide to our style (#30697)
  Change required version for Get Settings transport API changes to 6.4.0 (#30706)
  [DOCS] Fixes edit URLs for stack overview (#30583)
  Silence sleep based watcher test
  [TEST] Adjust version skips for movavg/movfn tests
  [DOCS] Replace X-Pack terms with attributes
  [ML] Clean left behind model state docs (#30659)
  Correct typos
  filters agg docs duplicated 'bucket' word removal (#30677)
  top_hits doc example description update (#30676)
  [Docs] Replace InetSocketTransportAddress with TransportAdress (#30673)
  [TEST] Account for increase in ML C++ memory usage (#30675)
  User proper write-once semantics for GCS repository (#30438)
  Remove bogus file accidentally added
  Add detailed assert message to IndexAuditUpgradeIT (#30669)
  Adjust fast forward for token expiration test  (#30668)
  Improve explanation in rescore (#30629)
  Deprecate `nGram` and `edgeNGram` names for ngram filters (#30209)
  Watcher: Fix watch history template for dynamic slack attachments (#30172)
  Fix _cluster/state to always return cluster_uuid (#30656)
  [Tests] Add debug information to CorruptedFileIT

# Conflicts:
#	test/framework/src/main/java/org/elasticsearch/indices/analysis/AnalysisFactoryTestCase.java
ywelsch pushed a commit to ywelsch/elasticsearch that referenced this pull request May 23, 2018
It is possible for state documents to be
left behind in the state index. This may be
because of bugs or uncontrollable scenarios.
In any case, those documents may take up quite
some disk space when they add up. This commit
adds a step in the expired data deletion that
is part of the daily maintenance service. The
new step searches for state documents that
do not belong to any of the current jobs and
deletes them.

Closes elastic#30551
@jpountz jpountz added v6.3.0 and removed v6.3.1 labels Jun 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants