Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Add multi-entity checkpoints read and write #256

Closed
wants to merge 5 commits into from

Conversation

kaituo
Copy link
Member

@kaituo kaituo commented Oct 14, 2020

Note: since there are a lot of dependencies, I only list the main class and test code to save reviewers' time. The build will fail due to missing dependencies. I will use that PR just for review. will not merge it. Will have a big one in the end and merge once after all review PRs get approved.

Issue #, if available:

Description of changes:

We need checkpoints to save states and models on disk. In single-entity detectors, we store rcf and threshold models separately in different docs. In multi-entity detectors, we need to store them together as we don't use distributed models anymore. We also need to store recent sample history when the models are not ready.

This PR adds functions to serialize models and samples together in one doc and deserialize them when needed. Also, we bulk indexing multi-entity detectors' checkpoints. Bulk requests will yield much better performance than single-document index requests What's more, I add detectorId field in the checkpoint index to be able to query checkpoints by detector id.

Testing done:

  1. added unit tests.
  2. end-to-end testing

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

We need checkpoints to save states and models on disk. In single-entity detectors, we store rcf and threshold models separately in different docs.  In multi-entity detectors, we need to store them together as we don't use distributed models anymore.  We also need to store recent sample history when the models are not ready.

This PR adds functions to serialize models and samples together in one doc and deserialize them when needed.  Also, we bulk indexing multi-entity detectors' checkpoints. Bulk requests will yield much better performance than single-document index requests  

Testing done:
1. added unit tests.
2. end-to-end testing
@codecov
Copy link

codecov bot commented Oct 14, 2020

Codecov Report

Merging #256 into master will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #256   +/-   ##
=========================================
  Coverage     73.01%   73.01%           
  Complexity     1461     1461           
=========================================
  Files           164      164           
  Lines          6834     6834           
  Branches        527      527           
=========================================
  Hits           4990     4990           
  Misses         1594     1594           
  Partials        250      250           
Flag Coverage Δ Complexity Δ
#cli 79.27% <ø> (ø) 0.00 <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...pendistroforelasticsearch/ad/ml/CheckpointDao.java 100.00% <100.00%> (ø) 14.00 <9.00> (ø)

Comment on lines 171 to 189
if (indexUtil.doesCheckpointIndexExist()) {
saveModelCheckpointSync(source, modelId);
} else {
indexUtil.initCheckpointIndex(ActionListener.wrap(initResponse -> {
if (initResponse.isAcknowledged()) {
saveModelCheckpointSync(source, modelId);
} else {
throw new RuntimeException("Creating checkpoint with mappings call not acknowledged.");
}
}, exception -> {
if (ExceptionsHelper.unwrapCause(exception) instanceof ResourceAlreadyExistsException) {
// It is possible the index has been created while we sending the create request
saveModelCheckpointSync(source, modelId);
} else {
logger.error(String.format("Unexpected error creating index %s", indexName), exception);
}
}));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like code is same as below. you may refactor them into a single method to avoid duplicate code.

Copy link
Member Author

@kaituo kaituo Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried, ended up giving up. The difference is the save method: one is calling saveModelCheckpointSync with two parameters, while another is calling saveModelCheckpointAsync with threee parameters. I created a funcitonal interface to consume three parameters (since JDK does not provide one) and make a generic method using functional interface. I gave up because this interface and method are only used once inside the class and the amount of code is more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if have a method called saveModelCheckpoint(source, modelId, isAsync, listener), and use isAsync to determine which saveModelCheckpointSync/saveModelCheckpointAsync method to call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea. Changed.

Comment on lines 171 to 189
if (indexUtil.doesCheckpointIndexExist()) {
saveModelCheckpointSync(source, modelId);
} else {
indexUtil.initCheckpointIndex(ActionListener.wrap(initResponse -> {
if (initResponse.isAcknowledged()) {
saveModelCheckpointSync(source, modelId);
} else {
throw new RuntimeException("Creating checkpoint with mappings call not acknowledged.");
}
}, exception -> {
if (ExceptionsHelper.unwrapCause(exception) instanceof ResourceAlreadyExistsException) {
// It is possible the index has been created while we sending the create request
saveModelCheckpointSync(source, modelId);
} else {
logger.error(String.format("Unexpected error creating index %s", indexName), exception);
}
}));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if have a method called saveModelCheckpoint(source, modelId, isAsync, listener), and use isAsync to determine which saveModelCheckpointSync/saveModelCheckpointAsync method to call?

// It is possible the index has been created while we sending the create request
flush(bulkRequest);
} else {
logger.error(String.format("Unexpected error creating index %s", indexName), exception);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case we will call write() method of checkpointDao? Without knowing that, I am not sure whether it is okay to swallow the exception here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will call it whenever we need to save a checkpoint. Any better option to not swallow it?

Copy link
Contributor

@yizheliu-amazon yizheliu-amazon Oct 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we throw the unexpected exception, will it break anything or make any change to existing workflow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not completely sure without testing. I guess the upstream will eventually swallow it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see. thanks.

kaituo added a commit that referenced this pull request Oct 16, 2020
* Add support filtering the data by one categorical variable

This PR is a conglomerate of the following PRs.

#247
#249
#250
#252
#253
#256
#257
#258
#259
#260
#261
#262
#263
#264
#265
#266
#267
#268
#269

This spreadsheet contains the mappings from files to PR number: https://quip-amazon.com/DiHkAmz9oSLu/HC-PR

Testing done:
1. Add unit tests except four classes (excluded in build.gradle). Will add them in the later PR.
2. Manual testing passes.
@kaituo kaituo closed this Oct 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants