Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only clean up ES indices when single indexer running #658

Merged
merged 4 commits into from
Nov 21, 2024

Conversation

jsharkey13
Copy link
Member

If more than one indexing process is happening, cleaning up unaliased indices is not a safe operation. There are several different ways the threads can conflict!
Much of the problem is caused by the fact each "version" is a collection of 5 indices {content, content_error, metadata, published_unit, unit} which each must get an alias. Each of these indices is created, and then aliased, separately.

  1. Job A is in the process of creating a new version's collection of ES indices; none of them have aliases yet. Job B concurrently reaches the cleanup stage, finds the indices Job A is adding data to, sees they are unaliased and deletes them. This causes Job A to fail, but this failure should be noticed and fail safe(ish).

  2. Job A has created all of the collection of indices for a new version, and is starting to move the various alias pointers. Job B reaches the cleanup stage, finds some mixture of two versions' indices without aliases and deletes them. This may cause Job A to fail, but will also leave the not-yet-aliased-as previous version in an invalid state; this version is being used by the API and the live site is broken.

This code now uses a thread-safe counter to skip running the deletion if another indexing job is running. However, this operation is not itself thread safe; it may take some time between the check and the response from ElasticSearch listing the indices and their aliases, in which time a new indexing job may have started creating indices. So we check again after getting the index list request that there are still no other concurrent jobs. This should catch all plausible cases.

Without truly serialising the whole ETL mechanics, I do not think we can make it perfect - but I believe this is a net improvement.


The first commit contains all the logic changes, later commits clean up some logging issues spotted during the work.

If more than one indexing process is happening, cleaning up unaliased
indices is not a safe operation. There are several different ways the
threads can conflict!
Much of the problem is caused by the fact each "version" is a collection
of 5 indices {content, content_error, metadata, published_unit, unit}
which each must get an alias. Each of these indices is created, and then
aliased, separately.

1) Job A is in the process of creating a new version's collection of
   ES indices; none of them have aliases yet. Job B concurrently reaches
   the cleanup stage, finds the indices Job A is adding data to, sees
   they are unaliased and deletes them. This causes Job A to fail, but
   this failure should be noticed and fail safe(ish).

2) Job A has created all of the collection of indices for a new version,
   and is starting to move the various alias pointers. Job B reaches the
   cleanup stage, finds some mixture of two versions' indices without
   aliases and deletes them. This may cause Job A to fail, but will also
   leave the not-yet-aliased-as previous version in an invalid state;
   this version is being used by the API and the live site is broken.

This code now uses a thread-safe counter to skip running the deletion if
another indexing job is running. However, this operation is not itself
thread safe; it may take some time between the check and the response
from ElasticSearch listing the indices and their aliases, in which time
a new indexing job may have started creating indices. So we check again
after getting the index list request that there are still no other
concurrent jobs. This should catch all plausible cases.

Without truly serialising the whole ETL mechanics, I do not think we can
make it perfect - but I believe this is a net improvement.
Copy link

codecov bot commented Nov 15, 2024

Codecov Report

Attention: Patch coverage is 0% with 32 lines in your changes missing coverage. Please review.

Project coverage is 34.42%. Comparing base (cdbb748) to head (37cbf43).
Report is 21 commits behind head on master.

Files with missing lines Patch % Lines
...in/java/uk/ac/cam/cl/dtg/segue/etl/ETLManager.java 0.00% 18 Missing ⚠️
.../ac/cam/cl/dtg/segue/etl/ElasticSearchIndexer.java 0.00% 11 Missing ⚠️
...ava/uk/ac/cam/cl/dtg/segue/etl/ContentIndexer.java 0.00% 2 Missing ⚠️
...ain/java/uk/ac/cam/cl/dtg/segue/etl/ETLFacade.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #658      +/-   ##
==========================================
+ Coverage   34.39%   34.42%   +0.02%     
==========================================
  Files         521      521              
  Lines       23378    23426      +48     
  Branches     2863     2870       +7     
==========================================
+ Hits         8042     8065      +23     
- Misses      14528    14549      +21     
- Partials      808      812       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

src/main/java/uk/ac/cam/cl/dtg/segue/etl/ETLManager.java Dismissed Show dismissed Hide dismissed
src/main/java/uk/ac/cam/cl/dtg/segue/etl/ETLManager.java Dismissed Show dismissed Hide dismissed
src/main/java/uk/ac/cam/cl/dtg/segue/etl/ETLManager.java Dismissed Show dismissed Hide dismissed
@jsharkey13 jsharkey13 requested a review from mwtrew November 18, 2024 09:49
Copy link
Contributor

@mwtrew mwtrew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding this, we don't mind whether the indexing step happens concurrently across multiple threads, but we want to stop any deletion step happening while others are indexing. If so, I think we can (mis?)use ReadWriteLock for this and it will be a bit simpler. Read locks are not mutually exclusive, so can be used for the indexing step. Write locks can't be acquired while any read or write locks are held, so can be used for the deletion step.

tryLock() for the write lock means the thread won't wait if the lock can't be acquired, it will just continue.

For example:

    private static ReadWriteLock indexLock = new ReentrantReadWriteLock();

(...)

    void setNamedVersion(final String alias, final String version) throws Exception {
        log.info("Requested new aliased version: {} - {}", alias, version);

        try {
            indexLock.readLock().lock();
            indexer.loadAndIndexContent(version);
            log.info("Indexed version {}. Setting alias '{}'.", version, alias);
            indexer.setNamedVersion(alias, version);
            // Store the alias to file so that we can recover after wiping ElasticSearch.
            this.contentIndicesStore.saveProperty(alias, version);
        } finally {
            indexLock.readLock().unlock();
        }

        if (indexLock.writeLock().tryLock()) {
            indexer.deleteAllUnaliasedIndices();
            indexLock.writeLock().unlock();
        }
    }

@jsharkey13
Copy link
Member Author

This does initially look more like what I wanted; some primitive that mapped to the use case! It didn't occur to me that the writing here maps to reading, and the deletion maps to writing, in a ReadWriteLock.

The most dangerous step is that if we are inside deleteAllUnaliasedIndices(), we need no new thread to be able to start indexing whilst we're inside it, and it looks like this does that.

I guess we also need to release the write lock in a try-finally too, right?

@mwtrew
Copy link
Contributor

mwtrew commented Nov 21, 2024

I guess we also need to release the write lock in a try-finally too, right?

On reflection yes, no harm in doing so and we wouldn't want the write lock to be held forever if something went wrong!

This is basically a move from an optimistic try-and-abort-if-necessary,
to a pessimistic block-or-skip-if-necessary. But the use of a primitive
feels nicer than the previous DIY job!
@jsharkey13 jsharkey13 requested a review from mwtrew November 21, 2024 10:56
@mwtrew mwtrew merged commit b915f20 into master Nov 21, 2024
3 checks passed
@mwtrew mwtrew deleted the hotfix/improve-etl-threadsafety branch November 21, 2024 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants