Add result indices retention period #174

kaituo · 2020-06-24T15:54:13Z

Issue #, if available:
#37

Description of changes:
Currently we never delete the result index, even though customers have deleted the detector. An increasing amount of result indices can use significant disk space, as well as memory pressure due to the creation of rolled over indices. This PR adds retention period to anomaly results. We delete result indices when they are older than a retention period, which is 90 days by default. We use 90 days because that's the maximum days we allow users to view results on Kibana. Users can configure the retention period via the setting opendistro.anomaly_detection.ad_result_history_retention_period dynamically.

Also, previously we roll over empty result indices. This PR fixes that by removing the max age condition of result indices. So we only roll over the result index when the maximum number of documents in the index is reached.

Testing done:

manually tested result indices would be deleted after passing retention period.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Currently we never delete the result index, even though customers have deleted the detector. An increasing amount of result indices can use significant disk space, as well as memory pressure due to the creation of rolled over indices. This PR adds retention period to anomaly results. We delete result indices when they are older than a retention period, which is 90 days by default. We use 90 days because that's the maximum days we allow users to view results on Kibana. Users can configure the retention period via the setting opendistro.anomaly_detection.ad_result_history_retention_period dynamically. Also, previously we roll over empty result indices. This PR fixes that by removing the max age condition of result indices. So we only roll over the result index when the maximum number of documents in the index is reached. Testing done: * manually tested result indices would be deleted after passing retention period.

ylwu-amzn · 2020-06-24T16:55:13Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/indices/AnomalyDetectionIndices.java

+                deleteOldHistoryIndices();
+            }
+        }, exception -> {
+            logger.error("Fail to roll over result index");


Why put into two lines? How about we use logger.error("Fail to roll over result index", exception); ? So we don't need to go to two lines when check log. There are maybe other request's log between these two lines, if check log manually, we need to skip other request's log.

good catch. Fixed.

ylwu-amzn · 2020-06-24T17:03:11Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/indices/AnomalyDetectionIndices.java

+
+            if (candidates.size() > 1) {
+                // delete all indices except the last one because the last one may contain docs newer than the retention period
+                candidates.remove(latestToDelete);


How about we just don't add latestToDelete to candidates in the for loop above?

How do we do that? We don't know latestToDelete before looping through all indices.

Can we iterate indices reversely and ignore the first index which meet this condition (Instant.now().toEpochMilli() - creationTime) > historyRetentionPeriod.millis() ?

This is a minor suggestion. Ignore it if it's hard/impossible to do so.

The indices has no order. So we cannot do that.

ylwu-amzn · 2020-06-24T17:08:37Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/indices/AnomalyDetectionIndices.java

+                    }
+                }, exception -> { deleteIndexIteration(toDelete); }));
+            }
+        }, exception -> { logger.error("Fail to get creation dates of result indices"); }));


This exception is for catching exception of clusterStateRequest? Not quite get why log Fail to get creation dates here. The exception only caused by get creation dates ?

good catch. Changed to more general error message.

ylwu-amzn · 2020-06-24T17:10:46Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/indices/AnomalyDetectionIndices.java

+                if (exception instanceof IndexNotFoundException) {
+                    logger.info("{} was already deleted.", index);
+                } else {
+                    logger.error("Retrying deleting {} does not succeed.", index);


Same here, why put into two lines?

ylwu-amzn · 2020-06-24T17:19:28Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/indices/AnomalyDetectionIndices.java

+                                "Could not delete one or more Anomaly result indices: {}. Retrying one by one.",
+                                Arrays.toString(toDelete)
+                            );
+                        deleteIndexIteration(toDelete);


Is it necessary to use two rounds of deletion? Roll over will be triggered periodically, so next run can re-delete the failed indices.
If fail to delete all indices due to some error, will it help we re-delete them immediately? Error may be still there

By default, we need to wait for 12 hours before re-deleting. Adding a retry can mitigate some disk/memory issue without waiting for too long. This is best effort and we cannot guarantee this would help. We don't retry endlessly. Just once.

Can we get which index failed to delete? Can we delete these failed indices in one call?

I don't know how to. Delete response is of AcknowledgedResponse type that contains only isAcknowledged field.

yizheliu-amazon · 2020-06-24T21:17:23Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/indices/AnomalyDetectionIndices.java

+                            );
+                        deleteIndexIteration(toDelete);
+                    } else {
+                        logger.error("Succeeded in deleting expired anomaly result indices: {}.", Arrays.toString(toDelete));


logger.info

good catch. Fixed.

yizheliu-amazon · 2020-06-24T21:17:53Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/indices/AnomalyDetectionIndices.java

+                    } else {
+                        logger.error("Succeeded in deleting expired anomaly result indices: {}.", Arrays.toString(toDelete));
+                    }
+                }, exception -> { deleteIndexIteration(toDelete); }));


maybe we can log the exception before retrying

yizheliu-amazon

2 minor comments. we may add UT or IT in the future.

kaituo · 2020-06-29T16:26:23Z

2 minor comments. we may add UT or IT in the future.

added UT.

* Add result indices retention period Currently we never delete the result index, even though customers have deleted the detector. An increasing amount of result indices can use significant disk space, as well as memory pressure due to the creation of rolled over indices. This PR adds retention period to anomaly results. We delete result indices when they are older than a retention period, which is 90 days by default. We use 90 days because that's the maximum days we allow users to view results on Kibana. Users can configure the retention period via the setting opendistro.anomaly_detection.ad_result_history_retention_period dynamically. Also, previously we roll over empty result indices. This PR fixes that by removing the max age condition of result indices. So we only roll over the result index when the maximum number of documents in the index is reached. Testing done: * manually tested result indices would be deleted after passing retention period.

kaituo requested review from ylwu-amzn and yizheliu-amazon June 24, 2020 15:54

ylwu-amzn added the enhancement New feature or request label Jun 24, 2020

ylwu-amzn reviewed Jun 24, 2020

View reviewed changes

refactor log lines

bbf1b54

yizheliu-amazon reviewed Jun 24, 2020

View reviewed changes

yizheliu-amazon approved these changes Jun 24, 2020

View reviewed changes

ylwu-amzn approved these changes Jun 26, 2020

View reviewed changes

Add UT

7fec24d

kaituo merged commit 0f53845 into opendistro-for-elasticsearch:master Jun 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add result indices retention period #174

Add result indices retention period #174

kaituo commented Jun 24, 2020

ylwu-amzn Jun 24, 2020

kaituo Jun 24, 2020

ylwu-amzn Jun 24, 2020 •

edited

Loading

kaituo Jun 24, 2020

ylwu-amzn Jun 24, 2020

kaituo Jun 26, 2020

ylwu-amzn Jun 24, 2020

kaituo Jun 24, 2020

ylwu-amzn Jun 24, 2020

kaituo Jun 24, 2020

ylwu-amzn Jun 24, 2020 •

edited

Loading

kaituo Jun 24, 2020

ylwu-amzn Jun 24, 2020

kaituo Jun 26, 2020

yizheliu-amazon Jun 24, 2020

kaituo Jun 26, 2020

yizheliu-amazon Jun 24, 2020

kaituo Jun 26, 2020

yizheliu-amazon left a comment

kaituo commented Jun 29, 2020

Add result indices retention period #174

Add result indices retention period #174

Conversation

kaituo commented Jun 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ylwu-amzn Jun 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ylwu-amzn Jun 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yizheliu-amazon left a comment

Choose a reason for hiding this comment

kaituo commented Jun 29, 2020

ylwu-amzn Jun 24, 2020 •

edited

Loading

ylwu-amzn Jun 24, 2020 •

edited

Loading