-
Notifications
You must be signed in to change notification settings - Fork 36
Conversation
Currently we never delete the result index, even though customers have deleted the detector. An increasing amount of result indices can use significant disk space, as well as memory pressure due to the creation of rolled over indices. This PR adds retention period to anomaly results. We delete result indices when they are older than a retention period, which is 90 days by default. We use 90 days because that's the maximum days we allow users to view results on Kibana. Users can configure the retention period via the setting opendistro.anomaly_detection.ad_result_history_retention_period dynamically. Also, previously we roll over empty result indices. This PR fixes that by removing the max age condition of result indices. So we only roll over the result index when the maximum number of documents in the index is reached. Testing done: * manually tested result indices would be deleted after passing retention period.
deleteOldHistoryIndices(); | ||
} | ||
}, exception -> { | ||
logger.error("Fail to roll over result index"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why put into two lines? How about we use logger.error("Fail to roll over result index", exception);
? So we don't need to go to two lines when check log. There are maybe other request's log between these two lines, if check log manually, we need to skip other request's log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. Fixed.
|
||
if (candidates.size() > 1) { | ||
// delete all indices except the last one because the last one may contain docs newer than the retention period | ||
candidates.remove(latestToDelete); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we just don't add latestToDelete
to candidates
in the for
loop above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we do that? We don't know latestToDelete before looping through all indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we iterate indices reversely and ignore the first index which meet this condition (Instant.now().toEpochMilli() - creationTime) > historyRetentionPeriod.millis()
?
This is a minor suggestion. Ignore it if it's hard/impossible to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indices has no order. So we cannot do that.
} | ||
}, exception -> { deleteIndexIteration(toDelete); })); | ||
} | ||
}, exception -> { logger.error("Fail to get creation dates of result indices"); })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exception is for catching exception of clusterStateRequest
? Not quite get why log Fail to get creation dates
here. The exception only caused by get creation dates
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. Changed to more general error message.
if (exception instanceof IndexNotFoundException) { | ||
logger.info("{} was already deleted.", index); | ||
} else { | ||
logger.error("Retrying deleting {} does not succeed.", index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, why put into two lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
"Could not delete one or more Anomaly result indices: {}. Retrying one by one.", | ||
Arrays.toString(toDelete) | ||
); | ||
deleteIndexIteration(toDelete); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to use two rounds of deletion? Roll over will be triggered periodically, so next run can re-delete the failed indices.
If fail to delete all indices due to some error, will it help we re-delete them immediately? Error may be still there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, we need to wait for 12 hours before re-deleting. Adding a retry can mitigate some disk/memory issue without waiting for too long. This is best effort and we cannot guarantee this would help. We don't retry endlessly. Just once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get which index failed to delete? Can we delete these failed indices in one call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to. Delete response is of AcknowledgedResponse type that contains only isAcknowledged field.
); | ||
deleteIndexIteration(toDelete); | ||
} else { | ||
logger.error("Succeeded in deleting expired anomaly result indices: {}.", Arrays.toString(toDelete)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. Fixed.
} else { | ||
logger.error("Succeeded in deleting expired anomaly result indices: {}.", Arrays.toString(toDelete)); | ||
} | ||
}, exception -> { deleteIndexIteration(toDelete); })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can log the exception before retrying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 minor comments. we may add UT or IT in the future.
added UT. |
* Add result indices retention period Currently we never delete the result index, even though customers have deleted the detector. An increasing amount of result indices can use significant disk space, as well as memory pressure due to the creation of rolled over indices. This PR adds retention period to anomaly results. We delete result indices when they are older than a retention period, which is 90 days by default. We use 90 days because that's the maximum days we allow users to view results on Kibana. Users can configure the retention period via the setting opendistro.anomaly_detection.ad_result_history_retention_period dynamically. Also, previously we roll over empty result indices. This PR fixes that by removing the max age condition of result indices. So we only roll over the result index when the maximum number of documents in the index is reached. Testing done: * manually tested result indices would be deleted after passing retention period.
Issue #, if available:
#37
Description of changes:
Currently we never delete the result index, even though customers have deleted the detector. An increasing amount of result indices can use significant disk space, as well as memory pressure due to the creation of rolled over indices. This PR adds retention period to anomaly results. We delete result indices when they are older than a retention period, which is 90 days by default. We use 90 days because that's the maximum days we allow users to view results on Kibana. Users can configure the retention period via the setting opendistro.anomaly_detection.ad_result_history_retention_period dynamically.
Also, previously we roll over empty result indices. This PR fixes that by removing the max age condition of result indices. So we only roll over the result index when the maximum number of documents in the index is reached.
Testing done:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.