Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Add result indices retention period #174

Merged
merged 3 commits into from
Jun 29, 2020

Conversation

kaituo
Copy link
Member

@kaituo kaituo commented Jun 24, 2020

Issue #, if available:
#37

Description of changes:
Currently we never delete the result index, even though customers have deleted the detector. An increasing amount of result indices can use significant disk space, as well as memory pressure due to the creation of rolled over indices. This PR adds retention period to anomaly results. We delete result indices when they are older than a retention period, which is 90 days by default. We use 90 days because that's the maximum days we allow users to view results on Kibana. Users can configure the retention period via the setting opendistro.anomaly_detection.ad_result_history_retention_period dynamically.

Also, previously we roll over empty result indices. This PR fixes that by removing the max age condition of result indices. So we only roll over the result index when the maximum number of documents in the index is reached.

Testing done:

  • manually tested result indices would be deleted after passing retention period.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Currently we never delete the result index, even though customers have deleted the detector.   An increasing amount of result indices can use significant disk space, as well as memory pressure due to the creation of rolled over indices. This PR adds retention period to anomaly results.  We delete result indices when they are older than a retention period, which is 90 days by default. We use 90 days because that's the maximum days we allow users to view results on Kibana.  Users can configure the retention period via the setting opendistro.anomaly_detection.ad_result_history_retention_period dynamically.

Also, previously we roll over empty result indices.  This PR fixes that by removing the max age condition of result indices.  So we only roll over the result index when the maximum number of documents in the index is reached.

Testing done:
* manually tested result indices would be deleted after passing retention period.
@ylwu-amzn ylwu-amzn added the enhancement New feature or request label Jun 24, 2020
deleteOldHistoryIndices();
}
}, exception -> {
logger.error("Fail to roll over result index");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why put into two lines? How about we use logger.error("Fail to roll over result index", exception); ? So we don't need to go to two lines when check log. There are maybe other request's log between these two lines, if check log manually, we need to skip other request's log.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. Fixed.


if (candidates.size() > 1) {
// delete all indices except the last one because the last one may contain docs newer than the retention period
candidates.remove(latestToDelete);
Copy link
Contributor

@ylwu-amzn ylwu-amzn Jun 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we just don't add latestToDelete to candidates in the for loop above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we do that? We don't know latestToDelete before looping through all indices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we iterate indices reversely and ignore the first index which meet this condition (Instant.now().toEpochMilli() - creationTime) > historyRetentionPeriod.millis() ?

This is a minor suggestion. Ignore it if it's hard/impossible to do so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indices has no order. So we cannot do that.

}
}, exception -> { deleteIndexIteration(toDelete); }));
}
}, exception -> { logger.error("Fail to get creation dates of result indices"); }));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception is for catching exception of clusterStateRequest? Not quite get why log Fail to get creation dates here. The exception only caused by get creation dates ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. Changed to more general error message.

if (exception instanceof IndexNotFoundException) {
logger.info("{} was already deleted.", index);
} else {
logger.error("Retrying deleting {} does not succeed.", index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, why put into two lines?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

"Could not delete one or more Anomaly result indices: {}. Retrying one by one.",
Arrays.toString(toDelete)
);
deleteIndexIteration(toDelete);
Copy link
Contributor

@ylwu-amzn ylwu-amzn Jun 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to use two rounds of deletion? Roll over will be triggered periodically, so next run can re-delete the failed indices.
If fail to delete all indices due to some error, will it help we re-delete them immediately? Error may be still there

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, we need to wait for 12 hours before re-deleting. Adding a retry can mitigate some disk/memory issue without waiting for too long. This is best effort and we cannot guarantee this would help. We don't retry endlessly. Just once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get which index failed to delete? Can we delete these failed indices in one call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to. Delete response is of AcknowledgedResponse type that contains only isAcknowledged field.

);
deleteIndexIteration(toDelete);
} else {
logger.error("Succeeded in deleting expired anomaly result indices: {}.", Arrays.toString(toDelete));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.info

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. Fixed.

} else {
logger.error("Succeeded in deleting expired anomaly result indices: {}.", Arrays.toString(toDelete));
}
}, exception -> { deleteIndexIteration(toDelete); }));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can log the exception before retrying

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Contributor

@yizheliu-amazon yizheliu-amazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 minor comments. we may add UT or IT in the future.

@kaituo
Copy link
Member Author

kaituo commented Jun 29, 2020

2 minor comments. we may add UT or IT in the future.

added UT.

@kaituo kaituo merged commit 0f53845 into opendistro-for-elasticsearch:master Jun 29, 2020
yizheliu-amazon pushed a commit that referenced this pull request Aug 28, 2020
* Add result indices retention period

Currently we never delete the result index, even though customers have deleted the detector.   An increasing amount of result indices can use significant disk space, as well as memory pressure due to the creation of rolled over indices. This PR adds retention period to anomaly results.  We delete result indices when they are older than a retention period, which is 90 days by default. We use 90 days because that's the maximum days we allow users to view results on Kibana.  Users can configure the retention period via the setting opendistro.anomaly_detection.ad_result_history_retention_period dynamically.

Also, previously we roll over empty result indices.  This PR fixes that by removing the max age condition of result indices.  So we only roll over the result index when the maximum number of documents in the index is reached.

Testing done:
* manually tested result indices would be deleted after passing retention period.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants