-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only clean up ES indices when single indexer running #658
Conversation
If more than one indexing process is happening, cleaning up unaliased indices is not a safe operation. There are several different ways the threads can conflict! Much of the problem is caused by the fact each "version" is a collection of 5 indices {content, content_error, metadata, published_unit, unit} which each must get an alias. Each of these indices is created, and then aliased, separately. 1) Job A is in the process of creating a new version's collection of ES indices; none of them have aliases yet. Job B concurrently reaches the cleanup stage, finds the indices Job A is adding data to, sees they are unaliased and deletes them. This causes Job A to fail, but this failure should be noticed and fail safe(ish). 2) Job A has created all of the collection of indices for a new version, and is starting to move the various alias pointers. Job B reaches the cleanup stage, finds some mixture of two versions' indices without aliases and deletes them. This may cause Job A to fail, but will also leave the not-yet-aliased-as previous version in an invalid state; this version is being used by the API and the live site is broken. This code now uses a thread-safe counter to skip running the deletion if another indexing job is running. However, this operation is not itself thread safe; it may take some time between the check and the response from ElasticSearch listing the indices and their aliases, in which time a new indexing job may have started creating indices. So we check again after getting the index list request that there are still no other concurrent jobs. This should catch all plausible cases. Without truly serialising the whole ETL mechanics, I do not think we can make it perfect - but I believe this is a net improvement.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #658 +/- ##
==========================================
+ Coverage 34.39% 34.42% +0.02%
==========================================
Files 521 521
Lines 23378 23426 +48
Branches 2863 2870 +7
==========================================
+ Hits 8042 8065 +23
- Misses 14528 14549 +21
- Partials 808 812 +4 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
src/main/java/uk/ac/cam/cl/dtg/segue/etl/ElasticSearchIndexer.java
Dismissed
Show dismissed
Hide dismissed
src/main/java/uk/ac/cam/cl/dtg/segue/etl/ElasticSearchIndexer.java
Dismissed
Show dismissed
Hide dismissed
src/main/java/uk/ac/cam/cl/dtg/segue/etl/ElasticSearchIndexer.java
Dismissed
Show dismissed
Hide dismissed
src/main/java/uk/ac/cam/cl/dtg/segue/etl/ElasticSearchIndexer.java
Dismissed
Show dismissed
Hide dismissed
src/main/java/uk/ac/cam/cl/dtg/segue/etl/ElasticSearchIndexer.java
Dismissed
Show dismissed
Hide dismissed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding this, we don't mind whether the indexing step happens concurrently across multiple threads, but we want to stop any deletion step happening while others are indexing. If so, I think we can (mis?)use ReadWriteLock for this and it will be a bit simpler. Read locks are not mutually exclusive, so can be used for the indexing step. Write locks can't be acquired while any read or write locks are held, so can be used for the deletion step.
tryLock()
for the write lock means the thread won't wait if the lock can't be acquired, it will just continue.
For example:
private static ReadWriteLock indexLock = new ReentrantReadWriteLock();
(...)
void setNamedVersion(final String alias, final String version) throws Exception {
log.info("Requested new aliased version: {} - {}", alias, version);
try {
indexLock.readLock().lock();
indexer.loadAndIndexContent(version);
log.info("Indexed version {}. Setting alias '{}'.", version, alias);
indexer.setNamedVersion(alias, version);
// Store the alias to file so that we can recover after wiping ElasticSearch.
this.contentIndicesStore.saveProperty(alias, version);
} finally {
indexLock.readLock().unlock();
}
if (indexLock.writeLock().tryLock()) {
indexer.deleteAllUnaliasedIndices();
indexLock.writeLock().unlock();
}
}
This does initially look more like what I wanted; some primitive that mapped to the use case! It didn't occur to me that the writing here maps to reading, and the deletion maps to writing, in a ReadWriteLock. The most dangerous step is that if we are inside I guess we also need to release the write lock in a try-finally too, right? |
On reflection yes, no harm in doing so and we wouldn't want the write lock to be held forever if something went wrong! |
This is basically a move from an optimistic try-and-abort-if-necessary, to a pessimistic block-or-skip-if-necessary. But the use of a primitive feels nicer than the previous DIY job!
If more than one indexing process is happening, cleaning up unaliased indices is not a safe operation. There are several different ways the threads can conflict!
Much of the problem is caused by the fact each "version" is a collection of 5 indices {content, content_error, metadata, published_unit, unit} which each must get an alias. Each of these indices is created, and then aliased, separately.
Job A is in the process of creating a new version's collection of ES indices; none of them have aliases yet. Job B concurrently reaches the cleanup stage, finds the indices Job A is adding data to, sees they are unaliased and deletes them. This causes Job A to fail, but this failure should be noticed and fail safe(ish).
Job A has created all of the collection of indices for a new version, and is starting to move the various alias pointers. Job B reaches the cleanup stage, finds some mixture of two versions' indices without aliases and deletes them. This may cause Job A to fail, but will also leave the not-yet-aliased-as previous version in an invalid state; this version is being used by the API and the live site is broken.
This code now uses a thread-safe counter to skip running the deletion if another indexing job is running. However, this operation is not itself thread safe; it may take some time between the check and the response from ElasticSearch listing the indices and their aliases, in which time a new indexing job may have started creating indices. So we check again after getting the index list request that there are still no other concurrent jobs. This should catch all plausible cases.
Without truly serialising the whole ETL mechanics, I do not think we can make it perfect - but I believe this is a net improvement.
The first commit contains all the logic changes, later commits clean up some logging issues spotted during the work.